Reorder the Columns in a CSV File in Descending Order

Question 1

I wrote a script to reorder the columns in a CSV file in descending order and then write to another CSV file. My script needs to be able to handle several tens of millions of records, and I would like it to be as performant as possible.

This is basically a mockup of a more complex CSV transformation I would be working on for work (I don't yet know the nature of the transformation I would be performing). To clarify, I was directed to write this at work, and it would be tested/scrutinised to see if its performant enough, but it's also not the final script we would eventually run.

CSV-Transform.py

import pandas as pd
import csv
chunksize = 10 ** 6 # Or whatever value the memory permits
source_file = ""
# Change to desired source file
destination_file = ""
# Change to desired destination file
def process(chunk, headers, dest):
 df = pd.DataFrame(chunk, columns=headers)
 df.to_csv(dest, header=False, index=False)
def transform_csv(source_file, destination_file):
 with open(source_file) as infile:
 reader = csv.DictReader(infile)
 new_headers = reader.fieldnames[::-1]
 
 with open(destination_file, "w+") as outfile:
 outfile.write(",".join(new_headers))
 outfile.write("\n")
 with open(destination_file, 'a') as outfile:
 for chunk in pd.read_csv(source_file, chunksize=chunksize):
 process(chunk, new_headers, outfile)
transform_csv(source_file, destination_file)

Question 2

Welcome to CodeReview@SE. mockup of a more complex CSV transformation I would be working on does not comply with actual code from a project rather than ... hypothetical code.

Question 3

I’m voting to close this question because a mockup of would-be code is not the concrete code from a project, with enough code and / or context required here.

Question 4

@greybeard: to clarify, I was directed by my boss to write this mockup first ahead of the task. So it's not just hypothetical code. It is actual code I wrote at work. The script would also be scrutinised to see if its performant enough. We just don't yet have the schema from our client.

Question 5

I have added the elaboration. I don't actually know if this script would be tested against 80 million records, but when my boss mentioned the task of transforming a CSV file of 80 million records, I pitched the approach of using a Pandas dataframe and he asked me to write this mockup first so he could check my approach and see if it would scale.

Question 6

If your boss ordered you to make the mock-up, it's still a mock-up. Is it at least a functional mock-up? As in, is it actually more a prototype perhaps? "and it would be tested/scrutinised to see if its performant enough" is this something that will eventually be done or has already been done? As in, has this already been tested?

Question 7

reorder the columns in a CSV file in descending order

Well... not exactly. You put the columns in reverse order. Descending order implies some sorting, which you aren't doing.

I would like it to be as performant as possible.

This would probably call for parallelism, multiprocessing or otherwise. I don't demonstrate that below.

Write 1e6 instead of 10**6.

source_file and destination_file probably shouldn't be globals.

process has a confused job. If it's really to do the processing, it should be doing the [::-1]; as it is, it's not particularly processing, but writing the data to disc.

transform_csv, to be more flexible, could accept file objects instead of filenames.

Don't use the csv module in this context; stick with Pandas.

Don't open outfile twice; only open it once.

Write unit tests.

Suggested

import io
import typing
import pandas as pd
def process(chunk: pd.DataFrame) -> pd.DataFrame:
 return chunk.iloc[:, ::-1]
def transform_csv(
 source_file: typing.TextIO,
 dest_file: typing.TextIO,
 chunk_lines: int = 1e6,
) -> None:
 with pd.read_csv(source_file, chunksize=chunk_lines) as chunks:
 # Headers for first chunk only
 process(next(chunks)).to_csv(dest_file, index=False)
 for chunk in chunks:
 process(chunk).to_csv(dest_file, index=False, header=False)
def test() -> None:
 with io.StringIO('''a,b,c
1,2,3
4,5,6
7,8,9
10,11,12
''', newline='\n') as source, io.StringIO() as dest:
 transform_csv(source, dest, chunk_lines=2)
 actual = dest.getvalue()
 assert actual.replace('\r\n', '\n') == '''c,b,a
3,2,1
6,5,4
9,8,7
12,11,10
'''
if __name__ == '__main__':
 test()

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2024-12-22 03:17:37Z

reorder the columns in a CSV file in descending order

Well... not exactly. You put the columns in reverse order. Descending order implies some sorting, which you aren't doing.

I would like it to be as performant as possible.

This would probably call for parallelism, multiprocessing or otherwise. I don't demonstrate that below.

Write 1e6 instead of 10**6.

source_file and destination_file probably shouldn't be globals.

process has a confused job. If it's really to do the processing, it should be doing the [::-1]; as it is, it's not particularly processing, but writing the data to disc.

transform_csv, to be more flexible, could accept file objects instead of filenames.

Don't use the csv module in this context; stick with Pandas.

Don't open outfile twice; only open it once.

Write unit tests.

Suggested

import io
import typing
import pandas as pd
def process(chunk: pd.DataFrame) -> pd.DataFrame:
 return chunk.iloc[:, ::-1]
def transform_csv(
 source_file: typing.TextIO,
 dest_file: typing.TextIO,
 chunk_lines: int = 1e6,
) -> None:
 with pd.read_csv(source_file, chunksize=chunk_lines) as chunks:
 # Headers for first chunk only
 process(next(chunks)).to_csv(dest_file, index=False)
 for chunk in chunks:
 process(chunk).to_csv(dest_file, index=False, header=False)
def test() -> None:
 with io.StringIO('''a,b,c
1,2,3
4,5,6
7,8,9
10,11,12
''', newline='\n') as source, io.StringIO() as dest:
 transform_csv(source, dest, chunk_lines=2)
 actual = dest.getvalue()
 assert actual.replace('\r\n', '\n') == '''c,b,a
3,2,1
6,5,4
9,8,7
12,11,10
'''
if __name__ == '__main__':
 test()

Stack Exchange Network

Reorder the Columns in a CSV File in Descending Order

1 Answer 1

Suggested

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Reorder the Columns in a CSV File in Descending Order

1 Answer 1

Suggested

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions