As part of my implementation of cross-validation, I find myself needing to split a list into chunks of roughly equal size.
import random
def chunk(xs, n):
ys = list(xs)
random.shuffle(ys)
ylen = len(ys)
size = int(ylen / n)
chunks = [ys[0+size*i : size*(i+1)] for i in xrange(n)]
leftover = ylen - size*n
edge = size*n
for i in xrange(leftover):
chunks[i%n].append(ys[edge+i])
return chunks
This works as intended
>>> chunk(range(10), 3)
[[4, 1, 2, 7], [5, 3, 6], [9, 8, 0]]
But it seems rather long and boring. Is there a library function that could perform this operation? Are there pythonic improvements that can be made to my code?
3 Answers 3
import random
def chunk(xs, n):
ys = list(xs)
Copies of lists are usually taken using xs[:]
random.shuffle(ys)
ylen = len(ys)
I don't think storing the length in a variable actually helps your code much
size = int(ylen / n)
Use size = ylen // n // is the integer division operator
chunks = [ys[0+size*i : size*(i+1)] for i in xrange(n)]
Why the 0+?
leftover = ylen - size*n
Actually, you can find size and leftover using size, leftover = divmod(ylen, n)
edge = size*n
for i in xrange(leftover):
chunks[i%n].append(ys[edge+i])
You can't have len(leftovers) >= n. So you can do:
for chunk, value in zip(chunks, leftover):
chunk.append(value)
return chunks
Some more improvement could be had if you used numpy. If this is part of a number crunching code you should look into it.
Is there a library function that could perform this operation?
No.
Are there pythonic improvements that can be made to my code?
A few.
Sorry it seems boring, but there's not much better you can do.
The biggest change might be to make this into a generator function, which may be a tiny bit neater.
def chunk(xs, n):
ys = list(xs)
random.shuffle(ys)
size = len(ys) // n
leftovers= ys[size*n:]
for c in xrange(n):
if leftovers:
extra= [ leftovers.pop() ]
else:
extra= []
yield ys[c*size:(c+1)*size] + extra
The use case changes, slightly, depending on what you're doing
chunk_list= list( chunk(range(10),3) )
The if statement can be removed, also, since it's really two generators. But that's being really fussy about performance.
def chunk(xs, n):
ys = list(xs)
random.shuffle(ys)
size = len(ys) // n
leftovers= ys[size*n:]
for c, xtra in enumerate(leftovers):
yield ys[c*size:(c+1)*size] + [ xtra ]
for c in xrange(c+1,n):
yield ys[c*size:(c+1)*size]
Make it a generator. You could then simplify the logic.
def chunk(xs, n):
ys = list(xs)
random.shuffle(ys)
chunk_length = len(ys) // n
needs_extra = len(ys) % n
start = 0
for i in xrange(n):
if i < needs_extra:
end = start + chunk_length + 1
else:
end = start + chunk_length
yield ys[start:end]
start = end