Find shape of n-dim array that'd be formed from nested list of lists with variable lengths if we were to pad the lists to same length at each nest level. E.g.
ls = [[1],
[2, 3],
[4, 5, 6]]
# (3, 3) because
ls_padded = [[1, 0, 0],
[2, 3, 0],
[4, 5, 6]]
Attempt
def find_shape(seq):
try:
len_ = len(seq)
except TypeError:
return ()
shapes = [find_shape(subseq) for subseq in seq]
return (len_,) + tuple(max(sizes) for sizes in
itertools.zip_longest(*shapes, fillvalue=1))
Problem
Too slow for large arrays. Can it be done faster? Test & bench code.
Solution shouldn't require list values at final nest depth, only basic attributes (e.g. len
), as that depth in application contains 1D arrays on GPU (and accessing values moves them back to CPU). It must work on n-dim arrays.
Exact goal is to attain such padding, but with choice of padding from left or from right. The data structure is a list of lists that's to form a 5D array, and only the final nest level contains non-lists, which are 1D arrays. First two nest levels have fixed list lengths (can directly form array), and the 1D arrays are of same length, so the only uncertainty is on 3rd and 4th dims.
2 Answers 2
From your question:
The data structure is a list of lists that's to form a 5D array, and only the final nest level contains non-lists, which are 1D arrays. First two nest levels have fixed list lengths (can directly form array), and the 1D arrays are of same length, so the only uncertainty is on 3rd and 4th dims.
From your comment:
and no empties
We can make it much faster by taking advantage of that specification.
Your solution walks over everything. Including the numbers, at the recursion leafs. Even always catching exceptions there, and "catching an exception is expensive"
Instead, for the fixed-size dimensions just take the first length, and for the others run over all their lists and take the maximum length.
from itertools import chain
def find_shape_new(seq):
flat = chain.from_iterable
return (
len(seq),
len(seq[0]),
max(map(len, flat(seq))),
max(map(len, flat(flat(seq)))),
len(seq[0][0][0][0]),
)
(Only partially tested, as your Test & bench code doesn't adhere to your specification.)
Could also be generalized, maybe like this:
def find_shape_new(seq, num_dims=None, fixed_dims=None):
...
Parameters:
fixed_dims
would be a set naming the fixed dimensions. Like{0, 1, 4}
or{1, 2, 5}
for your specification above. For each fixed dimension, the function would use thelen(seq[0][0][0][0])
way, and for each non-fixed dimension, it would use themax(map(len, flat(flat(seq))))
way.num_dims
would tell the dimensionality, e.g., for a 5D array it would be 5. If unknown, represented byNone
, you'd just keep going until you reach a number instead of a list. That would involve looking at a number, but only at one.
-
\$\begingroup\$ So basically take short route for dims 0, 1, 4 and apply a faster
find_shape_old
on the rest? It's what I had in mind but hoped there's better, though it may very well be that better isn't needed - will test this \$\endgroup\$OverLordGoldDragon– OverLordGoldDragon2021年10月15日 21:14:29 +00:00Commented Oct 15, 2021 at 21:14 -
\$\begingroup\$ @OverLordGoldDragon That, plus not looking at dim 5. It should be a lot better. Looking forward to your results. This is visiting almost only what needs to be visited, so I don't see how you'd hope for something better. \$\endgroup\$no comment– no comment2021年10月15日 22:24:11 +00:00Commented Oct 15, 2021 at 22:24
-
\$\begingroup\$ I omitted the simplifying parts to avoid "divide and conquer" solutions, instead focusing on the irreducible worst-case. Perhaps exhaustive iteration is the only way, which shifts optimization to parallelization. Your answer includes things I didn't think of, and I like that it can easily target the fixed dims, so I'll take it. \$\endgroup\$OverLordGoldDragon– OverLordGoldDragon2021年10月18日 19:06:11 +00:00Commented Oct 18, 2021 at 19:06
Get the length of the longest row, then pad each row with a sufficient number of zeros.
from itertools import islice, repeat, chain
def pad(x):
zeros = repeat(0)
n = maximum(map(len, x))
return [list(islice(chain(row, zeros), n)) for row in x]
zeros
is an "infinite" stream of 0s. It doesn't actually store them in memory, but the instance of repeat
defines a __next__
method that always returns 0.
chain
just glues two iterators together, so chain(row, zeros)
will first yield the elements of row
, then elements from zeros
.
islice(..., n)
yields the first n
elements of the given iterator. How many zeros it will yield from chain(row, zeros)
thus depends on how long row
is in the first place.
Visually, you are constructing a sequence of infinite rows, then taking finite prefixes from each. That is
[1, 0, 0, 0, 0, 0, ... ] -> [1, 0, 0]
[2, 3, 0, 0, 0, 0, ... ] -> [2, 3, 0]
[4, 5, 6, 0, 0, 0, ... ] -> [4, 5, 6]
The list comprehension then just puts each row in a single list to complete the new array.
-
\$\begingroup\$ Isn't this only for 2D? I specified n-dim in opening sentence and test code, maybe I should've been more explicit \$\endgroup\$OverLordGoldDragon– OverLordGoldDragon2021年10月14日 21:23:08 +00:00Commented Oct 14, 2021 at 21:23
-
\$\begingroup\$ Oops, I did miss the n-dim request. \$\endgroup\$chepner– chepner2021年10月14日 22:38:48 +00:00Commented Oct 14, 2021 at 22:38
-
\$\begingroup\$ I'll upvote but keep question open - clarified \$\endgroup\$OverLordGoldDragon– OverLordGoldDragon2021年10月14日 22:45:14 +00:00Commented Oct 14, 2021 at 22:45
(len_,) + tuple(...)
allocates 2 unnecessary objects that just go straight to the garbage collector after the concatenation but this seems like a micro-optimization given that depth would be unlikely to be huge, I'd imagine. \$\endgroup\$find_shape
is the bottleneck, especially when ran on GPU. Numpy can only create ragged, and not on GPU; Pytorch is used. If you have something in mind for this goal, I can open a question. \$\endgroup\$