Save list over many csv files each with given number of lines

Question 1

I wrote a little code for outputting to csv. It takes a list object named outtext and saves multiple .csv files. Each csv file contains cutoff elements of the list (except for the last), where cutoff is a specified number of elements/lines. This is useful when the user has to avoid writing to too large files (i.e. GitHub restricts file sizes too 100MB). The filenames are numbered from 0 to n, where n is the length of the output object divided by cutoff.

However, the code looks quite clunky and is quite long, given that it performs a relatively simple task:

import csv
import math
# dummy content
outtext = mylist = [None] * 300000
# Parameters specified by user
output_file = "path/name.csv"
cutoff = 150000
output_file_tokens = output_file.rsplit('.', 1)
num_files = int(math.ceil(len(outtext)/float(cutoff)))
for filenumber in range(num_files):
 counter = 0
 output_file = output_file_tokens[0] + str(filenumber) + "." + output_file_tokens[1]
 while counter <= cutoff:
 with open(output_file, 'wb') as f:
 writer = csv.writer(f)
 for line in outtext[:cutoff]:
 writer.writerow(line)
 counter += 1
 del outtext[:cutoff]
 print ">>> " + output_file + " successfully saved"

Is there room for improvement?

Question 2

Where does outtext come from? It's not defined in this script.

Question 3

outtext is just the text to write out. I thought it's not necessary to understand the code - sorry.

Question 4

I don't know where outtext comes from, but this seems overly complicated. There's too much nesting, and there's not enough separation of concerns.

Let's back up. We want to (a) chunk our list into sizes of cutoff and (b) write each such chunk into a new file. For each chunk, we create a new file, and write all of the rows. Each of those problems is bite-size and manageable.

First, from How do you split a list into evenly sized chunks?:

def chunks(l, n):
 """Yield successive n-sized chunks from l."""
 for i in xrange(0, len(l), n):
 yield l[i:i+n]

That gives us the first half of the problem:

for chunk in chunks(outtext, cutoff):
 # write chunk

Now for the second half, we just need an index. But rather than keeping our own counter, let's use enumerate(). Additionally, note that csvwriter has a writerows() method - so we don't need a loop for that.

Putting it all together:

for index, chunk in enumerate(chunks(outtext, cutoff)):
 output_file = '{}{}.{}'.format(output_file_tokens[0], index, output_file_tokens[1])
 with open(output_file, 'wb') as f:
 writer = csv.writer(f)
 writer.writerows(chunk)
 print ">>> " + output_file + " successfully saved"

You don't need to del anything. That's going to be pretty slow too - everytime we delete a chunk we have to shift everything in memory. This way, we don't have to do that.

Question 5

I didn't like the counter as well and knew about writerows(), but I didn't know how to do it without counter. Thanks for the great explanation - I learned something!

Barry BarryBarry 18.5k1 gold badge40 silver badges92 bronze badges · Accepted Answer · 2015-09-29 17:36:57Z

I don't know where outtext comes from, but this seems overly complicated. There's too much nesting, and there's not enough separation of concerns.

Let's back up. We want to (a) chunk our list into sizes of cutoff and (b) write each such chunk into a new file. For each chunk, we create a new file, and write all of the rows. Each of those problems is bite-size and manageable.

First, from How do you split a list into evenly sized chunks?:

def chunks(l, n):
 """Yield successive n-sized chunks from l."""
 for i in xrange(0, len(l), n):
 yield l[i:i+n]

That gives us the first half of the problem:

for chunk in chunks(outtext, cutoff):
 # write chunk

Now for the second half, we just need an index. But rather than keeping our own counter, let's use enumerate(). Additionally, note that csvwriter has a writerows() method - so we don't need a loop for that.

Putting it all together:

for index, chunk in enumerate(chunks(outtext, cutoff)):
 output_file = '{}{}.{}'.format(output_file_tokens[0], index, output_file_tokens[1])
 with open(output_file, 'wb') as f:
 writer = csv.writer(f)
 writer.writerows(chunk)
 print ">>> " + output_file + " successfully saved"

You don't need to del anything. That's going to be pretty slow too - everytime we delete a chunk we have to shift everything in memory. This way, we don't have to do that.

I didn't like the counter as well and knew about writerows(), but I didn't know how to do it without counter. Thanks for the great explanation - I learned something!

Stack Exchange Network

Save list over many csv files each with given number of lines

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Save list over many csv files each with given number of lines

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions