For some machine learning purpose, I need to work with sequences with different lengths. To be able to process efficiently those sequences, I need to process them in batches of size size_batch
. A batch typically has 4 dimensions and I want to convert it to a numpy's ndarray
with 4 dimensions. For each sequence, I need to pad with some defined pad_value
such that each element has the same size: the maximal size.
For example, with 3 dimensional input:
[[[0, 1, 2],
[3],
[4, 5]],
[[6]],
[[7, 8],
[9]]]
desired output for pad_value
-1 is:
[[[0, 1, 2],
[3, -1, -1],
[4, 5, -1]],
[[6, -1, -1],
[-1, -1, -1],
[-1, -1, -1]]
[[7, 8, -1],
[9, -1, -1],
[-1, -1, -1]]]
which has shape (3, 3, 3). For this problem, one can assume that there are no empty lists in the input. Here is the solution I came up with:
import numpy as np
import itertools as it
from typing import List
def pad(array: List, pad_value: np.int32, dtype: type = np.int32) -> np.ndarray:
""" Pads a nested list to the max shape and fill empty values with pad_value
:param array: high dimensional list to be padded
:param pad_value: value appended to
:param dtype: type of the output
:return: padded copy of param array
"""
# Get max shape
def get_max_shape(arr, ax=0, dims=[]):
try:
if ax >= len(dims):
dims.append(len(arr))
else:
dims[ax] = max(dims[ax], len(arr))
for i in arr:
get_max_shape(i, ax+1, dims)
except TypeError: # On non iterable / lengthless objects (leaves)
pass
return dims
dims = get_max_shape(array)
# Pad values
def get_item(arr, idx):
while True:
i, *idx = idx
arr = arr[i]
if not idx:
break
return arr
r = np.zeros(dims, dtype=dtype) + pad_value
for idx in it.product(*map(range, dims)):
# idx run though all possible tuple of indices that might
# contain a value in array
try:
r[idx] = get_item(array, idx)
except IndexError:
continue
return r
It does not feel really pythonic but does the job. Is there any better way to do it I should know ? I think I might be able to improve its speed by doing smart breaks in the last loop but I haven't dug much yet.
4 Answers 4
nested methods
Why do you nest the get_max_shape
etcetera in the pad
? There is no need to do this.
get_max_shape
Here you use recursion and a global variable. A simpler way would be to have a generator that recursively runs through the array, and yields the level and length of that part, and then another function to aggregate this results. That way you can avoid passing
def get_dimensions(array, level=0):
yield level, len(array)
try:
for row in array:
yield from get_dimensions(row, level + 1)
except TypeError: #not an iterable
pass
[(0, 3), (1, 3), (2, 3), (2, 1), (2, 2), (1, 1), (2, 1), (1, 2), (2, 2), (2, 1)]
The aggregation can be very simple using collections.defaultdict
:
def get_max_shape(array):
dimensions = defaultdict(int)
for level, length in get_dimensions(array):
dimensions[level] = max(dimensions[level], length)
return [value for _, value in sorted(dimensions.items())]
[3, 3, 3]
creating the result
Instead of r = np.zeros(dims, dtype=dtype) + pad_value
you can use np.full
You iterate over all possible indices, and check whether it is present in the original array. Depening on how "full" the original array is, this can save some time. It also allows you to do this without your custom get_item
method to get the element at the nested index
def iterate_nested_array(array, index=()):
try:
for idx, row in enumerate(array):
yield from iterate_nested_array(row, (*index, idx))
except TypeError: # final level
for idx, item in enumerate(array):
yield (*index, idx), item
[((0, 0, 0), 0), ((0, 0, 1), 1), ((0, 0, 2), 2), ((0, 1, 0), 3), ((0, 2, 0), 4), ((0, 2, 1), 5), ((1, 0, 0), 6), ((2, 0, 0), 7), ((2, 0, 1), 8), ((2, 1, 0), 9)]
slice
an even better way, as suggested by@hpaulj uses slices:
def iterate_nested_array(array, index=()):
try:
for idx, row in enumerate(array):
yield from iterate_nested_array(row, (*index, idx))
except TypeError: # final level
yield (*index, slice(len(array))), array
[((0, 0, slice(None, 3, None)), [0, 1, 2]), ((0, 1, slice(None, 1, None)), [3]), ((0, 2, slice(None, 2, None)), [4, 5]), ((1, 0, slice(None, 1, None)), [6]), ((2, 0, slice(None, 2, None)), [7, 8]), ((2, 1, slice(None, 1, None)), [9])]
padding
def pad(array, fill_value):
dimensions = get_max_shape(array)
result = np.full(dimensions, fill_value)
for index, value in iterate_nested_array(array):
result[index] = value
return result
array([[[ 0, 1, 2], [ 3, -1, -1], [ 4, 5, -1]], [[ 6, -1, -1], [-1, -1, -1], [-1, -1, -1]], [[ 7, 8, -1], [ 9, -1, -1], [-1, -1, -1]]])
-
\$\begingroup\$ I really should use
yield
more often ! it's a nice trick to save memory and enhance efficiency. \$\endgroup\$Thibault D.– Thibault D.2019年06月20日 15:10:14 +00:00Commented Jun 20, 2019 at 15:10 -
1\$\begingroup\$ You could save some time by assigning a whole slice on the last dimension:
result[i,j,:len(row)] = row
, but that requires 2 hard coded levels of looping, rather than your flexible recursive iteration. \$\endgroup\$hpaulj– hpaulj2019年06月24日 03:38:11 +00:00Commented Jun 24, 2019 at 3:38 -
\$\begingroup\$ brilliant suggestion. You don't need a second for-loop for that, just a small adjustment to
iterate_nested_array
. Dynamic typing can lead to such simple code \$\endgroup\$Maarten Fabré– Maarten Fabré2019年06月24日 06:33:04 +00:00Commented Jun 24, 2019 at 6:33
You can use awkward arrays: the package is meant to work with non-rectangular arrays.
You can create an awkward array directly. Then, it has a method to pad with None
to the desired length, a method to fill None
with a value and last but not least, convert it to a rencangular array.
Approximate example
arr = ak.from_iter(...)
arr = ak.pad_none(arr, target=4)
arr = ak.fill_none(arr, value=...)
arr = ak.to_regular(arr)
If you are willing to use tensorflow, the ragged
submodule has such functionality:
import tensorflow as tf
x = [[[0, 1, 2],
[3],
[4, 5]],
[[6]],
[[7, 8],
[9]]]
ragged = tf.ragged.constant(x)
ragged.to_tensor(default_value=-1)
print(ragged)
The result is then
<tf.Tensor: shape=(3, 3, 3), dtype=int32, numpy=
array([[[ 0, 1, 2],
[ 3, -1, -1],
[ 4, 5, -1]],
[[ 6, -1, -1],
[-1, -1, -1],
[-1, -1, -1]],
[[ 7, 8, -1],
[ 9, -1, -1],
[-1, -1, -1]]], dtype=int32)>
and you can see that every dimension is padded to its maximum length in the input.
If your input is 2-dimensional, you can get away with a much simpler solution (although you need to know the pad length).
import numpy as np
# Test Case:
data = [
[1],
[1, 2],
[1, 2, 3]
]
max_len = 3
# General Solution:
rectangle = np.zeros((len(data), max_len), dtype=np.int)
for i in range(len(data)):
rectangle[i:i + 1, 0:len(data[i])] = data[i]
print(rectangle)
This outputs:
[[1 0 0]
[1 2 0]
[1 2 3]]
itertools.zip_longest
is a handy tool for that, though not the fastest. \$\endgroup\$