Processing CSV files with filtering

Question 1

I have 4 CSV files that i need to process (filter out Part segment and then merge them) but the problem is that they do not fit in my memory. So I decided to: [open - filter - write out] each one of theses 4 files and merge them after reopening the filtered version.

I learned that it was good practice to decouple the functionality (filter functionality, merge functionality, write out functionality) but in this case splitting filter and dumping out functionality seem silly, it would be like wrapping an existing function from pandas library. But merging the 2 functionality make me also uncomfortable since I have heard that it is not good practice to write functions that return and have side effect (as writing out a CSV) as follow:

def exclude_segm_part(f):
 """ Filter out "Part client" on credit selection and write it on disk
 Args:
 f (string): filepath of the credit base (csv)
 """
 df = pd.read_csv(f,sep=';',encoding="latin")
 df_filter = df[df.LIB_SEGM_SR_ET!="PARTICULIERS"]
 filename = "{base}.txt".format(base=os.path.basename(f).split(".txt")[0])
 df_filter.to_csv(filename,index=False,sep=';')
 return df_filter

What would be your suggestion? (Hope my question is clear enough, I want to get the good practice of coding in data science environment)

Question 2

Welcome to Code Review! I changed the title so that it describes what the code does per site goals: "State what your code does in your title, not your main concerns about it.". Feel free to edit and give it a different title if there is something more appropriate.

Question 3

Do all the csv files have the same format (i.e., the same columns in the same order)? It would help to provide a sample input and output.

Question 4

First off, what you currently have is perfectly fine. The only thing I would suggest is to use pathlib.Path instead of manually using format and consistently follow the PEP8 naming scheme by using lower_case and having spaces after commas in argument lists:

from pathlib import Path
file_name = Path(f).with_suffix(".txt")

I see two ways you could take this code. One direction is making it more memory efficient by using normal Python to parse the file. This would allow you to make the code fully streaming and process all files in one go:

import csv
def filter_column(files, column, value):
 out_file_name = ...
 with open(out_file_name, "w") as out_file:
 writer = csv.writer(out_file, sep=";")
 for file_name in files:
 with open(file_name) as in_file:
 reader = csv.reader(in_file, sep=";")
 col = next(reader).index(column)
 writer.writerows(row for row in reader if row[col] != value)

This has almost no memory consumption, due to the writerows, which probably consumes the generator expression after it. If you want it fully independent of your memory (as long as a row fits into memory):

 for row in reader:
 if row[col] != value:
 csv_writer.writerow(row)

The other possibility is to go parallel and distributed and use something like dask:

import dask.dataframe as dd
files = "file1.csv", "file2.csv"
df = dd.read_csv(files, sep=";")
df_out = df[df.LIB_SEGM_SR_ET != "PARTICULIERS"]
file_name = ...
df_out.compute().to_csv(file_name, index=False, sep=";")

This gives you the full ease of using pandas and splits the task into batches that fit into memory behind the scenes.

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2021-01-29 08:51:28Z

First off, what you currently have is perfectly fine. The only thing I would suggest is to use pathlib.Path instead of manually using format and consistently follow the PEP8 naming scheme by using lower_case and having spaces after commas in argument lists:

from pathlib import Path
file_name = Path(f).with_suffix(".txt")

I see two ways you could take this code. One direction is making it more memory efficient by using normal Python to parse the file. This would allow you to make the code fully streaming and process all files in one go:

import csv
def filter_column(files, column, value):
 out_file_name = ...
 with open(out_file_name, "w") as out_file:
 writer = csv.writer(out_file, sep=";")
 for file_name in files:
 with open(file_name) as in_file:
 reader = csv.reader(in_file, sep=";")
 col = next(reader).index(column)
 writer.writerows(row for row in reader if row[col] != value)

This has almost no memory consumption, due to the writerows, which probably consumes the generator expression after it. If you want it fully independent of your memory (as long as a row fits into memory):

 for row in reader:
 if row[col] != value:
 csv_writer.writerow(row)

The other possibility is to go parallel and distributed and use something like dask:

import dask.dataframe as dd
files = "file1.csv", "file2.csv"
df = dd.read_csv(files, sep=";")
df_out = df[df.LIB_SEGM_SR_ET != "PARTICULIERS"]
file_name = ...
df_out.compute().to_csv(file_name, index=False, sep=";")

This gives you the full ease of using pandas and splits the task into batches that fit into memory behind the scenes.

Stack Exchange Network

Processing CSV files with filtering

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Processing CSV files with filtering

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions