Pad a ragged multidimensional array to rectangular shape

Question 1

For some machine learning purpose, I need to work with sequences with different lengths. To be able to process efficiently those sequences, I need to process them in batches of size size_batch. A batch typically has 4 dimensions and I want to convert it to a numpy's ndarray with 4 dimensions. For each sequence, I need to pad with some defined pad_value such that each element has the same size: the maximal size.

For example, with 3 dimensional input:

[[[0, 1, 2],
 [3],
 [4, 5]],
 [[6]],
 [[7, 8],
 [9]]]

desired output for pad_value -1 is:

[[[0, 1, 2],
 [3, -1, -1],
 [4, 5, -1]],
 [[6, -1, -1],
 [-1, -1, -1],
 [-1, -1, -1]]
 [[7, 8, -1],
 [9, -1, -1],
 [-1, -1, -1]]]

which has shape (3, 3, 3). For this problem, one can assume that there are no empty lists in the input. Here is the solution I came up with:

import numpy as np
import itertools as it
from typing import List
def pad(array: List, pad_value: np.int32, dtype: type = np.int32) -> np.ndarray:
 """ Pads a nested list to the max shape and fill empty values with pad_value
 :param array: high dimensional list to be padded
 :param pad_value: value appended to
 :param dtype: type of the output
 :return: padded copy of param array
 """
 # Get max shape
 def get_max_shape(arr, ax=0, dims=[]):
 try:
 if ax >= len(dims):
 dims.append(len(arr))
 else:
 dims[ax] = max(dims[ax], len(arr))
 for i in arr:
 get_max_shape(i, ax+1, dims)
 except TypeError: # On non iterable / lengthless objects (leaves)
 pass
 return dims
 dims = get_max_shape(array)
 # Pad values
 def get_item(arr, idx):
 while True:
 i, *idx = idx
 arr = arr[i]
 if not idx:
 break
 return arr
 r = np.zeros(dims, dtype=dtype) + pad_value
 for idx in it.product(*map(range, dims)):
 # idx run though all possible tuple of indices that might 
 # contain a value in array
 try:
 r[idx] = get_item(array, idx)
 except IndexError:
 continue
 return r

It does not feel really pythonic but does the job. Is there any better way to do it I should know ? I think I might be able to improve its speed by doing smart breaks in the last loop but I haven't dug much yet.

Question 2

Maybe you should have more descriptive variable names...

Question 3

Stackoverflow has lots of array pad questions, and clever answers. Most common case is padding 1d lists to form a 2d array. itertools.zip_longest is a handy tool for that, though not the fastest.

Question 4

nested methods

Why do you nest the get_max_shape etcetera in the pad? There is no need to do this.

get_max_shape

Here you use recursion and a global variable. A simpler way would be to have a generator that recursively runs through the array, and yields the level and length of that part, and then another function to aggregate this results. That way you can avoid passing

def get_dimensions(array, level=0):
 yield level, len(array)
 try:
 for row in array:
 yield from get_dimensions(row, level + 1)
 except TypeError: #not an iterable
 pass

[(0, 3), (1, 3), (2, 3), (2, 1), (2, 2), (1, 1), (2, 1), (1, 2), (2, 2), (2, 1)]

The aggregation can be very simple using collections.defaultdict:

def get_max_shape(array):
 dimensions = defaultdict(int)
 for level, length in get_dimensions(array):
 dimensions[level] = max(dimensions[level], length)
 return [value for _, value in sorted(dimensions.items())]

[3, 3, 3]

creating the result

Instead of r = np.zeros(dims, dtype=dtype) + pad_value you can use np.full

You iterate over all possible indices, and check whether it is present in the original array. Depening on how "full" the original array is, this can save some time. It also allows you to do this without your custom get_item method to get the element at the nested index

def iterate_nested_array(array, index=()):
 try:
 for idx, row in enumerate(array):
 yield from iterate_nested_array(row, (*index, idx))
 except TypeError: # final level
 for idx, item in enumerate(array):
 yield (*index, idx), item

[((0, 0, 0), 0),
 ((0, 0, 1), 1),
 ((0, 0, 2), 2),
 ((0, 1, 0), 3),
 ((0, 2, 0), 4),
 ((0, 2, 1), 5),
 ((1, 0, 0), 6),
 ((2, 0, 0), 7),
 ((2, 0, 1), 8),
 ((2, 1, 0), 9)]

slice

an even better way, as suggested by@hpaulj uses slices:

def iterate_nested_array(array, index=()):
 try:
 for idx, row in enumerate(array):
 yield from iterate_nested_array(row, (*index, idx))
 except TypeError: # final level 
 yield (*index, slice(len(array))), array

[((0, 0, slice(None, 3, None)), [0, 1, 2]),
 ((0, 1, slice(None, 1, None)), [3]),
 ((0, 2, slice(None, 2, None)), [4, 5]),
 ((1, 0, slice(None, 1, None)), [6]),
 ((2, 0, slice(None, 2, None)), [7, 8]),
 ((2, 1, slice(None, 1, None)), [9])]

padding

def pad(array, fill_value):
 dimensions = get_max_shape(array)
 result = np.full(dimensions, fill_value)
 for index, value in iterate_nested_array(array):
 result[index] = value
 return result

array([[[ 0, 1, 2],
 [ 3, -1, -1],
 [ 4, 5, -1]],
 [[ 6, -1, -1],
 [-1, -1, -1],
 [-1, -1, -1]],
 [[ 7, 8, -1],
 [ 9, -1, -1],
 [-1, -1, -1]]])

Question 5

I really should use yield more often ! it's a nice trick to save memory and enhance efficiency.

Question 6

You could save some time by assigning a whole slice on the last dimension: result[i,j,:len(row)] = row, but that requires 2 hard coded levels of looping, rather than your flexible recursive iteration.

Question 7

brilliant suggestion. You don't need a second for-loop for that, just a small adjustment to iterate_nested_array. Dynamic typing can lead to such simple code

Question 8

You can use awkward arrays: the package is meant to work with non-rectangular arrays.

You can create an awkward array directly. Then, it has a method to pad with None to the desired length, a method to fill None with a value and last but not least, convert it to a rencangular array.

Approximate example

arr = ak.from_iter(...)
arr = ak.pad_none(arr, target=4)
arr = ak.fill_none(arr, value=...)
arr = ak.to_regular(arr)

Question 9

If you are willing to use tensorflow, the ragged submodule has such functionality:

import tensorflow as tf
x = [[[0, 1, 2],
 [3],
 [4, 5]],
 [[6]],
 [[7, 8],
 [9]]]
ragged = tf.ragged.constant(x)
ragged.to_tensor(default_value=-1)
print(ragged)

The result is then

<tf.Tensor: shape=(3, 3, 3), dtype=int32, numpy=
array([[[ 0, 1, 2],
 [ 3, -1, -1],
 [ 4, 5, -1]],
 [[ 6, -1, -1],
 [-1, -1, -1],
 [-1, -1, -1]],
 [[ 7, 8, -1],
 [ 9, -1, -1],
 [-1, -1, -1]]], dtype=int32)>

and you can see that every dimension is padded to its maximum length in the input.

Question 10

If your input is 2-dimensional, you can get away with a much simpler solution (although you need to know the pad length).

import numpy as np
# Test Case:
data = [
 [1],
 [1, 2],
 [1, 2, 3]
]
max_len = 3
# General Solution:
rectangle = np.zeros((len(data), max_len), dtype=np.int)
for i in range(len(data)):
 rectangle[i:i + 1, 0:len(data[i])] = data[i]
print(rectangle)

This outputs:

[[1 0 0] 
 [1 2 0] 
 [1 2 3]]

Maarten Fabré Maarten Fabré 9,3901 gold badge15 silver badges27 bronze badges · Accepted Answer · 2019-06-20 14:01:55Z

nested methods

Why do you nest the get_max_shape etcetera in the pad? There is no need to do this.

get_max_shape

Here you use recursion and a global variable. A simpler way would be to have a generator that recursively runs through the array, and yields the level and length of that part, and then another function to aggregate this results. That way you can avoid passing

def get_dimensions(array, level=0):
 yield level, len(array)
 try:
 for row in array:
 yield from get_dimensions(row, level + 1)
 except TypeError: #not an iterable
 pass

[(0, 3), (1, 3), (2, 3), (2, 1), (2, 2), (1, 1), (2, 1), (1, 2), (2, 2), (2, 1)]

The aggregation can be very simple using collections.defaultdict:

def get_max_shape(array):
 dimensions = defaultdict(int)
 for level, length in get_dimensions(array):
 dimensions[level] = max(dimensions[level], length)
 return [value for _, value in sorted(dimensions.items())]

[3, 3, 3]

creating the result

Instead of r = np.zeros(dims, dtype=dtype) + pad_value you can use np.full

You iterate over all possible indices, and check whether it is present in the original array. Depening on how "full" the original array is, this can save some time. It also allows you to do this without your custom get_item method to get the element at the nested index

def iterate_nested_array(array, index=()):
 try:
 for idx, row in enumerate(array):
 yield from iterate_nested_array(row, (*index, idx))
 except TypeError: # final level
 for idx, item in enumerate(array):
 yield (*index, idx), item

[((0, 0, 0), 0),
 ((0, 0, 1), 1),
 ((0, 0, 2), 2),
 ((0, 1, 0), 3),
 ((0, 2, 0), 4),
 ((0, 2, 1), 5),
 ((1, 0, 0), 6),
 ((2, 0, 0), 7),
 ((2, 0, 1), 8),
 ((2, 1, 0), 9)]

slice

an even better way, as suggested by@hpaulj uses slices:

def iterate_nested_array(array, index=()):
 try:
 for idx, row in enumerate(array):
 yield from iterate_nested_array(row, (*index, idx))
 except TypeError: # final level 
 yield (*index, slice(len(array))), array

[((0, 0, slice(None, 3, None)), [0, 1, 2]),
 ((0, 1, slice(None, 1, None)), [3]),
 ((0, 2, slice(None, 2, None)), [4, 5]),
 ((1, 0, slice(None, 1, None)), [6]),
 ((2, 0, slice(None, 2, None)), [7, 8]),
 ((2, 1, slice(None, 1, None)), [9])]

padding

def pad(array, fill_value):
 dimensions = get_max_shape(array)
 result = np.full(dimensions, fill_value)
 for index, value in iterate_nested_array(array):
 result[index] = value
 return result

array([[[ 0, 1, 2],
 [ 3, -1, -1],
 [ 4, 5, -1]],
 [[ 6, -1, -1],
 [-1, -1, -1],
 [-1, -1, -1]],
 [[ 7, 8, -1],
 [ 9, -1, -1],
 [-1, -1, -1]]])

I really should use yield more often ! it's a nice trick to save memory and enhance efficiency.
You could save some time by assigning a whole slice on the last dimension: result[i,j,:len(row)] = row, but that requires 2 hard coded levels of looping, rather than your flexible recursive iteration.
brilliant suggestion. You don't need a second for-loop for that, just a small adjustment to iterate_nested_array. Dynamic typing can lead to such simple code

Stack Exchange Network

Pad a ragged multidimensional array to rectangular shape

4 Answers 4

nested methods

get_max_shape

creating the result

slice

padding

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Pad a ragged multidimensional array to rectangular shape

4 Answers 4

nested methods

get_max_shape

creating the result

slice

padding

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions