I wrote a script to reorder the columns in a CSV file in descending order and then write to another CSV file. My script needs to be able to handle several tens of millions of records, and I would like it to be as performant as possible.
This is basically a mockup of a more complex CSV transformation I would be working on for work (I don't yet know the nature of the transformation I would be performing). To clarify, I was directed to write this at work, and it would be tested/scrutinised to see if its performant enough, but it's also not the final script we would eventually run.
CSV-Transform.py
import pandas as pd
import csv
chunksize = 10 ** 6 # Or whatever value the memory permits
source_file = ""
# Change to desired source file
destination_file = ""
# Change to desired destination file
def process(chunk, headers, dest):
df = pd.DataFrame(chunk, columns=headers)
df.to_csv(dest, header=False, index=False)
def transform_csv(source_file, destination_file):
with open(source_file) as infile:
reader = csv.DictReader(infile)
new_headers = reader.fieldnames[::-1]
with open(destination_file, "w+") as outfile:
outfile.write(",".join(new_headers))
outfile.write("\n")
with open(destination_file, 'a') as outfile:
for chunk in pd.read_csv(source_file, chunksize=chunksize):
process(chunk, new_headers, outfile)
transform_csv(source_file, destination_file)
1 Answer 1
reorder the columns in a CSV file in descending order
Well... not exactly. You put the columns in reverse order. Descending order implies some sorting, which you aren't doing.
I would like it to be as performant as possible.
This would probably call for parallelism, multiprocessing or otherwise. I don't demonstrate that below.
Write 1e6
instead of 10**6
.
source_file
and destination_file
probably shouldn't be globals.
process
has a confused job. If it's really to do the processing, it should be doing the [::-1]
; as it is, it's not particularly processing, but writing the data to disc.
transform_csv
, to be more flexible, could accept file objects instead of filenames.
Don't use the csv
module in this context; stick with Pandas.
Don't open outfile
twice; only open it once.
Write unit tests.
Suggested
import io
import typing
import pandas as pd
def process(chunk: pd.DataFrame) -> pd.DataFrame:
return chunk.iloc[:, ::-1]
def transform_csv(
source_file: typing.TextIO,
dest_file: typing.TextIO,
chunk_lines: int = 1e6,
) -> None:
with pd.read_csv(source_file, chunksize=chunk_lines) as chunks:
# Headers for first chunk only
process(next(chunks)).to_csv(dest_file, index=False)
for chunk in chunks:
process(chunk).to_csv(dest_file, index=False, header=False)
def test() -> None:
with io.StringIO('''a,b,c
1,2,3
4,5,6
7,8,9
10,11,12
''', newline='\n') as source, io.StringIO() as dest:
transform_csv(source, dest, chunk_lines=2)
actual = dest.getvalue()
assert actual.replace('\r\n', '\n') == '''c,b,a
3,2,1
6,5,4
9,8,7
12,11,10
'''
if __name__ == '__main__':
test()
Explore related questions
See similar questions with these tags.
mockup of a more complex CSV transformation I would be working on
does not comply with actual code from a project rather than ... hypothetical code. \$\endgroup\$