I have Python code that splits a given large csv into smaller csvs. This large CSV has an ID column (column 1), which consecutive entries in the csv can share. The large csv might look something like this:
sfsddf8sdf8, 123, -234, dfsdfe, fsefsddfe
sfsddf8sdf8, 754, 464, sdfgdg, QFdgdfgdr
sfsddf8sdf8, 485, 469, mgyhjd, brgfgrdfg
sfsddf8sdf8, 274, -234, dnthfh, jyfhghfth
sfsddf8sdf8, 954, -145, lihgyb, fthgfhthj
powedfnsk93, 257, -139, sdfsfs, sdfsdfsdf
powedfnsk93, 284, -126, sdgdgr, sdagssdff
powedfnsk93, 257, -139, srfgfr, sdffffsss
erfsfeeeeef, 978, 677, dfgdrg, ssdttnmmm
etc...
The IDs are not sorted alphabetically in the input file, but consecutive identical IDs are grouped together.
My code does not split the IDs into different csvs, ensuring that each id appears in only one output csv.
My code is:
import pandas as pd
import os
def iterateIDs(file): #create chunks based on tripID
csv_reader = pd.read_csv(file, iterator=True, chunksize=1, header=None)
first_chunk = csv_reader.get_chunk()
id = first_chunk.iloc[0,0]
chunk = pd.DataFrame(first_chunk)
for l in csv_reader:
if id == l.iloc[0,0] or len(chunk)<1000000: #Keep adding to chunk if less than 1,000,000, or in middle of trip
id = l.iloc[0,0]
chunk = chunk.append(l)
continue
id = l.iloc[0,0]
yield chunk
chunk = pd.DataFrame(l)
yield chunk
waypoint_filesize = os.stat('TripRecordsReportWaypoints.csv').st_size #checks filesize
if waypoint_filesize > 100000000: #if file too big, split into seperate chunks
chunk_count = 1
chunk_Iterate = iterateIDs("TripRecordsReportWaypoints.csv")
for chunk in chunk_Iterate:
chunk.to_csv('SmallWaypoints_{}.csv'.format(chunk_count),header=None,index=None)
chunk_count = chunk_count+1
However, this code runs very slowly. I tested it on a small file, 284 MB and 3.5 million rows, however it took over an hour to run. Is there any way I can achieve this result quicker? I don't mind if it's outside of python.
2 Answers 2
If I understand correctly, you want to split a file into smaller files, based on size (no more than 1000000 lines) and ID (no ID should be split among files).
If that's the case, I think you're over-complicating things. You don't need pandas and you definitely don't need to keep all data in memory.
You just need two counters, one for the amount of lines you've written and one for the index of the file to write.
Sample code (of course, replace file names with what you need, or move the check after the write to start from 0
instead of 1
):
current_id = ''
index = 0
written_lines = 0
max_lines = 1000000
with open('data.csv', 'r') as input_file:
for line in input_file:
values = line.split(',')
if (current_id != values[0]) or (written_lines > max_lines):
index += 1
current_id = values[0]
with open('output_{:08d}.csv'.format(index), 'a') as output_file:
output_file.write(line)
written_lines += 1
EDIT: This works assuming the file is sorted or that at least the IDs are grouped together as you said in the comment.
-
1\$\begingroup\$ I think that 'or' should be an 'and', otherwise it creates a csv for every id \$\endgroup\$Joshua Kidd– Joshua Kidd2017年03月29日 11:54:01 +00:00Commented Mar 29, 2017 at 11:54
-
\$\begingroup\$ Shouldn't
written_lines
be updated somewhere ? \$\endgroup\$bli– bli2017年03月29日 13:25:30 +00:00Commented Mar 29, 2017 at 13:25
I tested the following with a smaller value for max_lines
and a small test file. It seems to work correctly (more than one id can be grouped in the same file) and is slightly faster than ChatterOne's proposal. I tried to avoid opening a file for each line to be written, hoping this makes the code fast enough. However, the buffering could lead to memory problems with large values of max_lines
:
#!/usr/bin/env python3
# More lines can actually be written
# if a given id has more lines than this
max_lines = 100000000
def group_by_id(file):
"""This generator assumes that file has at least one line.
It yields bunches of lines having the same first field."""
lines = [file.readline()]
last_id = lines[-1].split(",")[0]
for line in file:
id = line.split(",")[0]
if id == last_id:
lines.append(line)
else:
yield lines, len(lines)
last_id = id
lines = [line]
yield lines, len(lines)
def main():
with open("data.csv") as input_file:
chunk_id = 0
nb_buffered = 0
line_buffer = []
for lines, nb_lines in group_by_id(input_file):
if nb_buffered + nb_lines > max_lines:
# We need to write the current bunch of lines in a file
chunk_id += 1
with open("output_%d.csv" % chunk_id, "w") as output_file:
output_file.write("".join(line_buffer))
# Reset the bunch of lines to be written
line_buffer = lines
nb_buffered = nb_lines
else:
# Update the bunch of lines to be written
line_buffer.extend(lines)
nb_buffered += nb_lines
# Deal with the last bunch of lines
chunk_id += 1
with open("output_%d.csv" % chunk_id, "w") as output_file:
output_file.write("".join(line_buffer))
if __name__ == "__main__":
main()
-
\$\begingroup\$ You don't need the
exit(0)
at the end, the Python interpreter will exit automatically. I would also put the whole code in aif __name__ == "__main__":
block to allow importing the function in another script. \$\endgroup\$Graipher– Graipher2017年03月29日 16:02:31 +00:00Commented Mar 29, 2017 at 16:02 -
\$\begingroup\$ @Graipher I had read somewhere that it was best practice to explicitly exit with 0 when everything seemed to have occurred OK. I'll add the
__main__
stuff. \$\endgroup\$bli– bli2017年03月30日 08:00:37 +00:00Commented Mar 30, 2017 at 8:00 -
1\$\begingroup\$ When the Python interpreter exits, it does so with status 0, unless you explicitly use
exit(n)
withn > 0
. Also,exit
is for the interactive interpreter, usesys.exit
in a script if you need it. stackoverflow.com/questions/6501121/… \$\endgroup\$Graipher– Graipher2017年03月30日 08:04:32 +00:00Commented Mar 30, 2017 at 8:04 -
\$\begingroup\$ @bli, can you update your code to add the header in each file? \$\endgroup\$5a01d01P– 5a01d01P2020年02月28日 06:55:42 +00:00Commented Feb 28, 2020 at 6:55
Explore related questions
See similar questions with these tags.