I have currently 600 CSV files (and this number will grow) of 50K lines each i would like to put in one single dataframe. I did this, it works well and it takes 3 minutes :
colNames = ['COLUMN_A', 'COLUMN_B',...,'COLUMN_Z']
folder = 'PATH_TO_FOLDER'
# Dictionnary of type for each column of the csv which is not string
dictTypes = {'COLUMN_B' : bool,'COLUMN_D' :int, ... ,'COLUMN_Y':float}
try:
# Get all the column names, if it's not in the dict of type, it's a string and we add it to the dict
dictTypes.update({col: str for col in colNames if col not in dictTypes})
except:
print('Problem with the column names.')
# Function allowing to parse the dates from string to date, we put in the read_csv method
cache = {}
def cached_date_parser(s):
if s in cache:
return cache[s]
dt = pd.to_datetime(s, format='%Y-%m-%d', errors="coerce")
cache[s] = dt
return dt
# Concatenate each df in finalData
allFiles = glob.glob(os.path.join(folder, "*.csv"))
finalData = pd.DataFrame()
finalData = pd.concat([pd.read_csv(file, index_col=False, dtype=dictTypes, parse_dates=[6,14],
date_parser=cached_date_parser) for file in allFiles ], ignore_index=True)
It takes one minute less without the parsing date thing. So i was wondering if i could improve the speed or it was a standard amount of time regarding the number of files. Thanks !
2 Answers 2
Here is my untested feedback on your code. Some remarks:
- Encapsulate the functionality as a named function. I assumed
folder_path
as the main "variant" your calling code might want to vary, but your use case might "call" for a different first argument. - Use PEP8 recommendations for variable names.
- Comb/separate the different concerns within the function:
- gather input files
- handle column types
- read CSVs and parse dates
- Depending on how much each of those concerns grows in size over time, multiple separate functions could organically grow out of these separate paragraphs, ultimately leading to a whole utility package or class (depending on how much "instance" configuration you would need to preserve, moving the
column_names
anddtypes
parameters to object attributes of aclass XyzCsvReader
's__init__
method.) - Concerning the date parsing: probably the bottleneck is not caused by caching or not, but how often you invoke the heavy machinery behind
pd.to_datetime
. My guess is that only calling it once in the end, but withinfer_datetime_format
enabled will be much faster than calling it once per row (even with your manual cache).
import glob
import os
import pandas as pd
def read_xyz_csv_folder(
folder_path,
column_names=None,
dtypes=None):
all_files = glob.glob(os.path.join(folder_path, "*.csv"))
if column_names is None:
column_names = [
'COLUMN_A',
'COLUMN_B', # ...
'COLUMN_Z']
if dtypes is None:
dtypes = {
'COLUMN_B': bool,
'COLUMN_D': int,
'COLUMN_Y': float}
dtypes.update({col: str for col in column_names
if col not in dtypes})
result = pd.concat((
pd.read_csv(file, index_col=False, dtype=dtypes)
for file in all_files),
ignore_index=True)
# untested pseudo-code, but idea: call to_datetime only once
result['date'] = pd.to_datetime(
result[[6, 14]],
infer_datetime_format=True,
errors='coerce')
return result
# use as
read_xyz_csv_folder('PATH_TO_FOLDER')
Edit: as suggested by user FMc in their comment, switch from a list comprehension to a generator expression within pd.concat
to not create an unneeded list.
-
\$\begingroup\$ thanks, same idea than GZ0 in the comment with the
infer_datetime_format=True
but it's better by a few seconds with your untested idea \$\endgroup\$TmSmth– TmSmth2020年08月27日 08:40:09 +00:00Commented Aug 27, 2020 at 8:40
How often do you do this?
Are all csv-files changed or are there new files that should be aggregated?
Perhaps you could reuse the dataframe and just add the new files.
Otherwise you could create a dictionary, and store it as a json. Then you would not have to read all files, and when reading the json to a
dict() d
, you could just used.update(new_dict)
.
pd.concat()
function will take not only sequences (eg,list
ortuple
) but any iterable, so you don't need to create a never-used list. Instead, just givepd.concat()
a generator expression -- a lightweight piece of code thatpd.concat()
will execute on your behalf to populate the data frame. Like this:pd.concat((pd.read_csv(...) for file in allFiles), ...)
\$\endgroup\$colNames
andfolder
from? \$\endgroup\$date_parser=cached_date_parser
withinfer_datetime_format=True
in theread_csv
call? The API document says reading could be faster if the format is correctly inferred. \$\endgroup\$