4
\$\begingroup\$

I wrote a script to reorder the columns in a CSV file in descending order and then write to another CSV file. My script needs to be able to handle several tens of millions of records, and I would like it to be as performant as possible.

This is basically a mockup of a more complex CSV transformation I would be working on for work (I don't yet know the nature of the transformation I would be performing). To clarify, I was directed to write this at work, and it would be tested/scrutinised to see if its performant enough, but it's also not the final script we would eventually run.

CSV-Transform.py

import pandas as pd
import csv
chunksize = 10 ** 6 # Or whatever value the memory permits
source_file = ""
# Change to desired source file
destination_file = ""
# Change to desired destination file
def process(chunk, headers, dest):
 df = pd.DataFrame(chunk, columns=headers)
 df.to_csv(dest, header=False, index=False)
def transform_csv(source_file, destination_file):
 with open(source_file) as infile:
 reader = csv.DictReader(infile)
 new_headers = reader.fieldnames[::-1]
 
 with open(destination_file, "w+") as outfile:
 outfile.write(",".join(new_headers))
 outfile.write("\n")
 with open(destination_file, 'a') as outfile:
 for chunk in pd.read_csv(source_file, chunksize=chunksize):
 process(chunk, new_headers, outfile)
transform_csv(source_file, destination_file)
Peilonrayz
44.4k7 gold badges80 silver badges157 bronze badges
asked Aug 31, 2020 at 9:52
\$\endgroup\$
5
  • 2
    \$\begingroup\$ Welcome to CodeReview@SE. mockup of a more complex CSV transformation I would be working on does not comply with actual code from a project rather than ... hypothetical code. \$\endgroup\$ Commented Aug 31, 2020 at 10:19
  • 2
    \$\begingroup\$ I’m voting to close this question because a mockup of would-be code is not the concrete code from a project, with enough code and / or context required here. \$\endgroup\$ Commented Aug 31, 2020 at 10:22
  • \$\begingroup\$ @greybeard: to clarify, I was directed by my boss to write this mockup first ahead of the task. So it's not just hypothetical code. It is actual code I wrote at work. The script would also be scrutinised to see if its performant enough. We just don't yet have the schema from our client. \$\endgroup\$ Commented Aug 31, 2020 at 10:24
  • \$\begingroup\$ I have added the elaboration. I don't actually know if this script would be tested against 80 million records, but when my boss mentioned the task of transforming a CSV file of 80 million records, I pitched the approach of using a Pandas dataframe and he asked me to write this mockup first so he could check my approach and see if it would scale. \$\endgroup\$ Commented Aug 31, 2020 at 10:32
  • 1
    \$\begingroup\$ If your boss ordered you to make the mock-up, it's still a mock-up. Is it at least a functional mock-up? As in, is it actually more a prototype perhaps? "and it would be tested/scrutinised to see if its performant enough" is this something that will eventually be done or has already been done? As in, has this already been tested? \$\endgroup\$ Commented Aug 31, 2020 at 14:40

1 Answer 1

2
\$\begingroup\$

reorder the columns in a CSV file in descending order

Well... not exactly. You put the columns in reverse order. Descending order implies some sorting, which you aren't doing.

I would like it to be as performant as possible.

This would probably call for parallelism, multiprocessing or otherwise. I don't demonstrate that below.

Write 1e6 instead of 10**6.

source_file and destination_file probably shouldn't be globals.

process has a confused job. If it's really to do the processing, it should be doing the [::-1]; as it is, it's not particularly processing, but writing the data to disc.

transform_csv, to be more flexible, could accept file objects instead of filenames.

Don't use the csv module in this context; stick with Pandas.

Don't open outfile twice; only open it once.

Write unit tests.

Suggested

import io
import typing
import pandas as pd
def process(chunk: pd.DataFrame) -> pd.DataFrame:
 return chunk.iloc[:, ::-1]
def transform_csv(
 source_file: typing.TextIO,
 dest_file: typing.TextIO,
 chunk_lines: int = 1e6,
) -> None:
 with pd.read_csv(source_file, chunksize=chunk_lines) as chunks:
 # Headers for first chunk only
 process(next(chunks)).to_csv(dest_file, index=False)
 for chunk in chunks:
 process(chunk).to_csv(dest_file, index=False, header=False)
def test() -> None:
 with io.StringIO('''a,b,c
1,2,3
4,5,6
7,8,9
10,11,12
''', newline='\n') as source, io.StringIO() as dest:
 transform_csv(source, dest, chunk_lines=2)
 actual = dest.getvalue()
 assert actual.replace('\r\n', '\n') == '''c,b,a
3,2,1
6,5,4
9,8,7
12,11,10
'''
if __name__ == '__main__':
 test()
answered Dec 22, 2024 at 3:17
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.