I have a process where I take several .lst flatfiles, apply filters and reformat the data, then append the results to a single file. I loop through every file in a subdirectory, applying the reformatting, then I append it to a file called Myoutput. I also strip out some rows of the dataframe and write this to a different .lst called Myoutput2.
My code:
fileNames = next(os.walk('subdirectory'))[2]
for filename in fileNames:
headers = ['My', 'list', 'of', 'headers']
columns = [0,2,3,4]
filePath = r'subdirectory\\' + filename
df = pd.read_csv(filePath, sep='\t', names=headers, usecols=columns)
#My reformatting code...
with open('output/Myoutput_{}_{}_{}.lst'.format(year,month,day), 'a') as f:
df.to_csv(f, header=False, index=False, sep='\t')
df = df.loc[df['Type']!='UNKNOWN'] #removes UNKNOWN types for paragon
paragonCount += len(df)
with open('output/Myoutput2_{}_{}_{}.lst'.format(year,month,day), 'a') as f:
df.to_csv(f, header=False, index=False, sep='\t')
os.remove('download/'+filename)
This process has worked so far but I now have to change it so that the two output files are ordered by a datetime field in the files. All the datetime values in the field are on the same day. The input files are not chunked by datetime, so it wouldn't be enough to just change the order I load the files.
I appended this code to the end:
df = pd.read_csv('output/Myoutput_{}_{}_{}.lst'.format(year,month,day),header=None,sep='\t')
df = df.sort_values(by=1)
df.to_csv('output/Myoutput_{}_{}_{}_ordered.lst'.format(year,month,day),index=None,headers=None)
df = pd.read_csv('output/Myoutput2_{}_{}_{}.lst'.format(year,month,day),header=None,sep='\t')
df = df.sort_values(by=1)
df.to_csv('output/Myoutput2_{}_{}_{}_ordered.lst'.format(year,month,day),index=None,headers=None)
The output files tend to be around 1.5GB, so this doubles the runtime of my script, and it feels inefficient re-loading the data into memory. Is there any way to speed this up, perhaps have the loop appending the results based on their datetime field instead of on the end of the file?
1 Answer 1
With the output DataFrames being about 1.5Gb in file, the reason your performance is slow may be that you have insufficient memory to perform these operations quickly.
One way to reduce the amount of memory used is to is to use the inplace
keyword argument common to some of the pandas.DataFrame
methods. Setting inplace=True
, when available, makes the operation augment the DataFrame in place; otherwise, inplace=False
will create a copy (which can consume more memory).
One quick suggestion, therefore, might be to replace:
df = df.sort_values(by=1)
with df.sort_values(by=1, inplace=True)
.
This answer on StackOverflow may also be of interest (it talks about different optimisations for holding data in DataFrames; including inplace
): https://stackoverflow.com/a/39377643/6852391
-
\$\begingroup\$ I tried this but it doesn't make much of a difference, the machine I'm working on has 16GB RAM \$\endgroup\$Joshua Kidd– Joshua Kidd2017年09月21日 08:15:38 +00:00Commented Sep 21, 2017 at 8:15
year
,month
,day
come from. Also, why are you reading files from'subdirectory'
but deleting them from'download'
? Also, why are you opening the output files in append mode rather than write mode? \$\endgroup\$