Variable length nested list of lists pad shape

Question 1

Find shape of n-dim array that'd be formed from nested list of lists with variable lengths if we were to pad the lists to same length at each nest level. E.g.

ls = [[1],
 [2, 3],
 [4, 5, 6]]
# (3, 3) because
ls_padded = [[1, 0, 0],
 [2, 3, 0],
 [4, 5, 6]]

Attempt

def find_shape(seq):
 try:
 len_ = len(seq)
 except TypeError:
 return ()
 shapes = [find_shape(subseq) for subseq in seq]
 return (len_,) + tuple(max(sizes) for sizes in
 itertools.zip_longest(*shapes, fillvalue=1))

Problem

Too slow for large arrays. Can it be done faster? Test & bench code.

Solution shouldn't require list values at final nest depth, only basic attributes (e.g. len), as that depth in application contains 1D arrays on GPU (and accessing values moves them back to CPU). It must work on n-dim arrays.

Exact goal is to attain such padding, but with choice of padding from left or from right. The data structure is a list of lists that's to form a 5D array, and only the final nest level contains non-lists, which are 1D arrays. First two nest levels have fixed list lengths (can directly form array), and the 1D arrays are of same length, so the only uncertainty is on 3rd and 4th dims.

Question 2

Is there a reason you're seemingly rewriting basic NumPy functionality here? Why not use NumPy to manage your arrays? How large/layers deep are you dealing with? Could you write a class abstraction that stores the length upon construction? I can say right off the bat that (len_,) + tuple(...) allocates 2 unnecessary objects that just go straight to the garbage collector after the concatenation but this seems like a micro-optimization given that depth would be unlikely to be huge, I'd imagine.

Question 3

Given that if you know nothing about the sequence, you'll have to check every element recursively, you can't do better algorithmicly speaking. You could maybe optimize the implementation but not much.

Question 4

@ggorlen End goal is to create the padded n-dim array as described, except need freedom to pad from left or right. I have the full function and find_shape is the bottleneck, especially when ran on GPU. Numpy can only create ragged, and not on GPU; Pytorch is used. If you have something in mind for this goal, I can open a question.

Question 5

What do you mean with "Solution shouldn't require list values, only basic attributes"? You can't check a value's attributes without the value.

Question 6

Ok, that allows much optimization. So it's guaranteed 5D and only third and fourth are unknown? That would also allow simpler code, and in that case it would help if your Test & bench code reflected that (that's excellent stuff, btw, I wish every question was that helpful :-), so that solutions making those assumptions would work there.

Question 7

From your question:

The data structure is a list of lists that's to form a 5D array, and only the final nest level contains non-lists, which are 1D arrays. First two nest levels have fixed list lengths (can directly form array), and the 1D arrays are of same length, so the only uncertainty is on 3rd and 4th dims.

From your comment:

and no empties

We can make it much faster by taking advantage of that specification.

Your solution walks over everything. Including the numbers, at the recursion leafs. Even always catching exceptions there, and "catching an exception is expensive"

Instead, for the fixed-size dimensions just take the first length, and for the others run over all their lists and take the maximum length.

from itertools import chain
def find_shape_new(seq):
 flat = chain.from_iterable
 return (
 len(seq),
 len(seq[0]),
 max(map(len, flat(seq))),
 max(map(len, flat(flat(seq)))),
 len(seq[0][0][0][0]),
 )

(Only partially tested, as your Test & bench code doesn't adhere to your specification.)

Could also be generalized, maybe like this:

def find_shape_new(seq, num_dims=None, fixed_dims=None):
 ...

Parameters:

fixed_dims would be a set naming the fixed dimensions. Like {0, 1, 4} or {1, 2, 5} for your specification above. For each fixed dimension, the function would use the len(seq[0][0][0][0]) way, and for each non-fixed dimension, it would use the max(map(len, flat(flat(seq)))) way.
num_dims would tell the dimensionality, e.g., for a 5D array it would be 5. If unknown, represented by None, you'd just keep going until you reach a number instead of a list. That would involve looking at a number, but only at one.

Question 8

So basically take short route for dims 0, 1, 4 and apply a faster find_shape_old on the rest? It's what I had in mind but hoped there's better, though it may very well be that better isn't needed - will test this

Question 9

@OverLordGoldDragon That, plus not looking at dim 5. It should be a lot better. Looking forward to your results. This is visiting almost only what needs to be visited, so I don't see how you'd hope for something better.

Question 10

I omitted the simplifying parts to avoid "divide and conquer" solutions, instead focusing on the irreducible worst-case. Perhaps exhaustive iteration is the only way, which shifts optimization to parallelization. Your answer includes things I didn't think of, and I like that it can easily target the fixed dims, so I'll take it.

Question 11

Get the length of the longest row, then pad each row with a sufficient number of zeros.

from itertools import islice, repeat, chain
def pad(x):
 zeros = repeat(0)
 n = maximum(map(len, x))
 return [list(islice(chain(row, zeros), n)) for row in x]

zeros is an "infinite" stream of 0s. It doesn't actually store them in memory, but the instance of repeat defines a __next__ method that always returns 0.

chain just glues two iterators together, so chain(row, zeros) will first yield the elements of row, then elements from zeros.

islice(..., n) yields the first n elements of the given iterator. How many zeros it will yield from chain(row, zeros) thus depends on how long row is in the first place.

Visually, you are constructing a sequence of infinite rows, then taking finite prefixes from each. That is

[1, 0, 0, 0, 0, 0, ... ] -> [1, 0, 0]
[2, 3, 0, 0, 0, 0, ... ] -> [2, 3, 0]
[4, 5, 6, 0, 0, 0, ... ] -> [4, 5, 6]

The list comprehension then just puts each row in a single list to complete the new array.

Question 12

Isn't this only for 2D? I specified n-dim in opening sentence and test code, maybe I should've been more explicit

Question 13

Oops, I did miss the n-dim request.

Question 14

I'll upvote but keep question open - clarified

no comment no comment 1,0806 silver badges10 bronze badges · Accepted Answer · 2021-10-15 20:12:53Z

From your question:

The data structure is a list of lists that's to form a 5D array, and only the final nest level contains non-lists, which are 1D arrays. First two nest levels have fixed list lengths (can directly form array), and the 1D arrays are of same length, so the only uncertainty is on 3rd and 4th dims.

From your comment:

and no empties

We can make it much faster by taking advantage of that specification.

Your solution walks over everything. Including the numbers, at the recursion leafs. Even always catching exceptions there, and "catching an exception is expensive"

Instead, for the fixed-size dimensions just take the first length, and for the others run over all their lists and take the maximum length.

from itertools import chain
def find_shape_new(seq):
 flat = chain.from_iterable
 return (
 len(seq),
 len(seq[0]),
 max(map(len, flat(seq))),
 max(map(len, flat(flat(seq)))),
 len(seq[0][0][0][0]),
 )

(Only partially tested, as your Test & bench code doesn't adhere to your specification.)

Could also be generalized, maybe like this:

def find_shape_new(seq, num_dims=None, fixed_dims=None):
 ...

Parameters:

fixed_dims would be a set naming the fixed dimensions. Like {0, 1, 4} or {1, 2, 5} for your specification above. For each fixed dimension, the function would use the len(seq[0][0][0][0]) way, and for each non-fixed dimension, it would use the max(map(len, flat(flat(seq)))) way.
num_dims would tell the dimensionality, e.g., for a 5D array it would be 5. If unknown, represented by None, you'd just keep going until you reach a number instead of a list. That would involve looking at a number, but only at one.

So basically take short route for dims 0, 1, 4 and apply a faster find_shape_old on the rest? It's what I had in mind but hoped there's better, though it may very well be that better isn't needed - will test this
@OverLordGoldDragon That, plus not looking at dim 5. It should be a lot better. Looking forward to your results. This is visiting almost only what needs to be visited, so I don't see how you'd hope for something better.
I omitted the simplifying parts to avoid "divide and conquer" solutions, instead focusing on the irreducible worst-case. Perhaps exhaustive iteration is the only way, which shifts optimization to parallelization. Your answer includes things I didn't think of, and I like that it can easily target the fixed dims, so I'll take it.

Stack Exchange Network

Variable length nested list of lists pad shape

Attempt

Problem

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Variable length nested list of lists pad shape

Attempt

Problem

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions