I have 4 CSV files that i need to process (filter out Part segment and then merge them) but the problem is that they do not fit in my memory. So I decided to: [open - filter - write out] each one of theses 4 files and merge them after reopening the filtered version.
I learned that it was good practice to decouple the functionality (filter functionality, merge functionality, write out functionality) but in this case splitting filter and dumping out functionality seem silly, it would be like wrapping an existing function from pandas library. But merging the 2 functionality make me also uncomfortable since I have heard that it is not good practice to write functions that return and have side effect (as writing out a CSV) as follow:
def exclude_segm_part(f):
""" Filter out "Part client" on credit selection and write it on disk
Args:
f (string): filepath of the credit base (csv)
"""
df = pd.read_csv(f,sep=';',encoding="latin")
df_filter = df[df.LIB_SEGM_SR_ET!="PARTICULIERS"]
filename = "{base}.txt".format(base=os.path.basename(f).split(".txt")[0])
df_filter.to_csv(filename,index=False,sep=';')
return df_filter
What would be your suggestion? (Hope my question is clear enough, I want to get the good practice of coding in data science environment)
-
\$\begingroup\$ Welcome to Code Review! I changed the title so that it describes what the code does per site goals: "State what your code does in your title, not your main concerns about it.". Feel free to edit and give it a different title if there is something more appropriate. \$\endgroup\$Sᴀᴍ Onᴇᴌᴀ– Sᴀᴍ Onᴇᴌᴀ ♦2021年01月28日 19:10:53 +00:00Commented Jan 28, 2021 at 19:10
-
1\$\begingroup\$ Do all the csv files have the same format (i.e., the same columns in the same order)? It would help to provide a sample input and output. \$\endgroup\$RootTwo– RootTwo2021年01月29日 04:22:49 +00:00Commented Jan 29, 2021 at 4:22
1 Answer 1
First off, what you currently have is perfectly fine. The only thing I would suggest is to use pathlib.Path
instead of manually using format
and consistently follow the PEP8 naming scheme by using lower_case
and having spaces after commas in argument lists:
from pathlib import Path
file_name = Path(f).with_suffix(".txt")
I see two ways you could take this code. One direction is making it more memory efficient by using normal Python to parse the file. This would allow you to make the code fully streaming and process all files in one go:
import csv
def filter_column(files, column, value):
out_file_name = ...
with open(out_file_name, "w") as out_file:
writer = csv.writer(out_file, sep=";")
for file_name in files:
with open(file_name) as in_file:
reader = csv.reader(in_file, sep=";")
col = next(reader).index(column)
writer.writerows(row for row in reader if row[col] != value)
This has almost no memory consumption, due to the writerows
, which probably consumes the generator expression after it. If you want it fully independent of your memory (as long as a row fits into memory):
for row in reader:
if row[col] != value:
csv_writer.writerow(row)
The other possibility is to go parallel and distributed and use something like dask
:
import dask.dataframe as dd
files = "file1.csv", "file2.csv"
df = dd.read_csv(files, sep=";")
df_out = df[df.LIB_SEGM_SR_ET != "PARTICULIERS"]
file_name = ...
df_out.compute().to_csv(file_name, index=False, sep=";")
This gives you the full ease of using pandas
and splits the task into batches that fit into memory behind the scenes.
Explore related questions
See similar questions with these tags.