I have a lot of compressed csv files in a directory. I want to read all those files in a single dataframe. This is what I have done till now:
df = pd.DataFrame(columns=col_names)
for filename in os.listdir(path):
with gzip.open(path+"/"+filename, 'rb') as f:
temp = pd.read_csv(f, names=col_names)
df = df.append(temp)
I have noticed that the above code runs quite fast initially, but it keeps on getting slower and slower as it reads more and more files. How can I improve this?
1 Answer 1
Ultimate optimization
avoid calling
pd.DataFrame.append
function within a loop as it'll create a copy of accumulated dataframe on each loop iteration. Applypandas.concat
to concatenate pandas objects at once.no need to
gzip.open
aspandas.read_csv
already allows on-the-fly decompression of on-disk data.compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’
avoid hardcoding filepathes with
path+"/"+filename
. Instead use suitableos.path.join
feature:os.path.join(dirpath, fname)
The final optimized approach:
import os
import pandas as pd
dirpath = 'path_to_gz_files' # your directory path
df = pd.concat([pd.read_csv(os.path.join(dirpath, fname))
for fname in os.listdir(dirpath)], ignore_index=True)
Explore related questions
See similar questions with these tags.
.tar.gz
or.gz
? \$\endgroup\$.gz
\$\endgroup\$