Python code to split csv into smaller csvs, not splitting IDs

Question 1

I have Python code that splits a given large csv into smaller csvs. This large CSV has an ID column (column 1), which consecutive entries in the csv can share. The large csv might look something like this:

sfsddf8sdf8, 123, -234, dfsdfe, fsefsddfe
sfsddf8sdf8, 754, 464, sdfgdg, QFdgdfgdr
sfsddf8sdf8, 485, 469, mgyhjd, brgfgrdfg
sfsddf8sdf8, 274, -234, dnthfh, jyfhghfth
sfsddf8sdf8, 954, -145, lihgyb, fthgfhthj
powedfnsk93, 257, -139, sdfsfs, sdfsdfsdf
powedfnsk93, 284, -126, sdgdgr, sdagssdff
powedfnsk93, 257, -139, srfgfr, sdffffsss
erfsfeeeeef, 978, 677, dfgdrg, ssdttnmmm
etc...

The IDs are not sorted alphabetically in the input file, but consecutive identical IDs are grouped together.

My code does not split the IDs into different csvs, ensuring that each id appears in only one output csv.

My code is:

import pandas as pd
import os
def iterateIDs(file): #create chunks based on tripID
 csv_reader = pd.read_csv(file, iterator=True, chunksize=1, header=None)
 first_chunk = csv_reader.get_chunk()
 id = first_chunk.iloc[0,0]
 chunk = pd.DataFrame(first_chunk)
 for l in csv_reader:
 if id == l.iloc[0,0] or len(chunk)<1000000: #Keep adding to chunk if less than 1,000,000, or in middle of trip
 id = l.iloc[0,0]
 chunk = chunk.append(l)
 continue
 id = l.iloc[0,0]
 yield chunk
 chunk = pd.DataFrame(l)
 yield chunk
waypoint_filesize = os.stat('TripRecordsReportWaypoints.csv').st_size #checks filesize
if waypoint_filesize > 100000000: #if file too big, split into seperate chunks
 chunk_count = 1
 chunk_Iterate = iterateIDs("TripRecordsReportWaypoints.csv")
 for chunk in chunk_Iterate:
 chunk.to_csv('SmallWaypoints_{}.csv'.format(chunk_count),header=None,index=None)
 chunk_count = chunk_count+1

However, this code runs very slowly. I tested it on a small file, 284 MB and 3.5 million rows, however it took over an hour to run. Is there any way I can achieve this result quicker? I don't mind if it's outside of python.

Question 2

If I understand correctly, you want to split a file into smaller files, based on size (no more than 1000000 lines) and ID (no ID should be split among files).

If that's the case, I think you're over-complicating things. You don't need pandas and you definitely don't need to keep all data in memory.

You just need two counters, one for the amount of lines you've written and one for the index of the file to write.

Sample code (of course, replace file names with what you need, or move the check after the write to start from 0 instead of 1):

current_id = ''
index = 0
written_lines = 0
max_lines = 1000000
with open('data.csv', 'r') as input_file:
 for line in input_file:
 values = line.split(',')
 if (current_id != values[0]) or (written_lines > max_lines):
 index += 1
 current_id = values[0]
 with open('output_{:08d}.csv'.format(index), 'a') as output_file:
 output_file.write(line)
 written_lines += 1

EDIT: This works assuming the file is sorted or that at least the IDs are grouped together as you said in the comment.

Question 3

I think that 'or' should be an 'and', otherwise it creates a csv for every id

Question 4

Shouldn't written_lines be updated somewhere ?

Question 5

I tested the following with a smaller value for max_lines and a small test file. It seems to work correctly (more than one id can be grouped in the same file) and is slightly faster than ChatterOne's proposal. I tried to avoid opening a file for each line to be written, hoping this makes the code fast enough. However, the buffering could lead to memory problems with large values of max_lines:

#!/usr/bin/env python3
# More lines can actually be written
# if a given id has more lines than this
max_lines = 100000000
def group_by_id(file):
 """This generator assumes that file has at least one line.
 It yields bunches of lines having the same first field."""
 lines = [file.readline()]
 last_id = lines[-1].split(",")[0]
 for line in file:
 id = line.split(",")[0]
 if id == last_id:
 lines.append(line)
 else:
 yield lines, len(lines)
 last_id = id
 lines = [line]
 yield lines, len(lines)
def main():
 with open("data.csv") as input_file:
 chunk_id = 0
 nb_buffered = 0
 line_buffer = []
 for lines, nb_lines in group_by_id(input_file):
 if nb_buffered + nb_lines > max_lines:
 # We need to write the current bunch of lines in a file
 chunk_id += 1
 with open("output_%d.csv" % chunk_id, "w") as output_file:
 output_file.write("".join(line_buffer))
 # Reset the bunch of lines to be written
 line_buffer = lines
 nb_buffered = nb_lines
 else:
 # Update the bunch of lines to be written
 line_buffer.extend(lines)
 nb_buffered += nb_lines
 # Deal with the last bunch of lines
 chunk_id += 1
 with open("output_%d.csv" % chunk_id, "w") as output_file:
 output_file.write("".join(line_buffer))
if __name__ == "__main__":
 main()

Question 6

You don't need the exit(0) at the end, the Python interpreter will exit automatically. I would also put the whole code in a if __name__ == "__main__": block to allow importing the function in another script.

Question 7

@Graipher I had read somewhere that it was best practice to explicitly exit with 0 when everything seemed to have occurred OK. I'll add the __main__ stuff.

Question 8

When the Python interpreter exits, it does so with status 0, unless you explicitly use exit(n) with n > 0. Also, exit is for the interactive interpreter, use sys.exit in a script if you need it. stackoverflow.com/questions/6501121/…

Question 9

@bli, can you update your code to add the header in each file?

ChatterOne ChatterOne 2,84512 silver badges18 bronze badges · Accepted Answer · 2017-03-29 11:17:51Z

If I understand correctly, you want to split a file into smaller files, based on size (no more than 1000000 lines) and ID (no ID should be split among files).

If that's the case, I think you're over-complicating things. You don't need pandas and you definitely don't need to keep all data in memory.

You just need two counters, one for the amount of lines you've written and one for the index of the file to write.

Sample code (of course, replace file names with what you need, or move the check after the write to start from 0 instead of 1):

current_id = ''
index = 0
written_lines = 0
max_lines = 1000000
with open('data.csv', 'r') as input_file:
 for line in input_file:
 values = line.split(',')
 if (current_id != values[0]) or (written_lines > max_lines):
 index += 1
 current_id = values[0]
 with open('output_{:08d}.csv'.format(index), 'a') as output_file:
 output_file.write(line)
 written_lines += 1

EDIT: This works assuming the file is sorted or that at least the IDs are grouped together as you said in the comment.

I think that 'or' should be an 'and', otherwise it creates a csv for every id

Stack Exchange Network

Python code to split csv into smaller csvs, not splitting IDs

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python code to split csv into smaller csvs, not splitting IDs

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions