How to speed up ordering of output file

Question 1

I have a process where I take several .lst flatfiles, apply filters and reformat the data, then append the results to a single file. I loop through every file in a subdirectory, applying the reformatting, then I append it to a file called Myoutput. I also strip out some rows of the dataframe and write this to a different .lst called Myoutput2.

My code:

fileNames = next(os.walk('subdirectory'))[2] 
for filename in fileNames:
 headers = ['My', 'list', 'of', 'headers']
 columns = [0,2,3,4]
 filePath = r'subdirectory\\' + filename
 df = pd.read_csv(filePath, sep='\t', names=headers, usecols=columns)
 #My reformatting code...
 with open('output/Myoutput_{}_{}_{}.lst'.format(year,month,day), 'a') as f:
 df.to_csv(f, header=False, index=False, sep='\t')
 df = df.loc[df['Type']!='UNKNOWN'] #removes UNKNOWN types for paragon
 paragonCount += len(df)
 with open('output/Myoutput2_{}_{}_{}.lst'.format(year,month,day), 'a') as f:
 df.to_csv(f, header=False, index=False, sep='\t')
 os.remove('download/'+filename)

This process has worked so far but I now have to change it so that the two output files are ordered by a datetime field in the files. All the datetime values in the field are on the same day. The input files are not chunked by datetime, so it wouldn't be enough to just change the order I load the files.

I appended this code to the end:

df = pd.read_csv('output/Myoutput_{}_{}_{}.lst'.format(year,month,day),header=None,sep='\t')
df = df.sort_values(by=1)
df.to_csv('output/Myoutput_{}_{}_{}_ordered.lst'.format(year,month,day),index=None,headers=None)
df = pd.read_csv('output/Myoutput2_{}_{}_{}.lst'.format(year,month,day),header=None,sep='\t')
df = df.sort_values(by=1)
df.to_csv('output/Myoutput2_{}_{}_{}_ordered.lst'.format(year,month,day),index=None,headers=None)

The output files tend to be around 1.5GB, so this doubles the runtime of my script, and it feels inefficient re-loading the data into memory. Is there any way to speed this up, perhaps have the loop appending the results based on their datetime field instead of on the end of the file?

Question 2

Arrange so your "old" file is in order. Sort your "new" data into a different file, in order. Then perform what is called a "merge" between the two. This will be \$O(n)\$ instead of \$O(n log n)\,ドル so you should see a performance win.

Question 3

It's not clear where the year, month, day come from. Also, why are you reading files from 'subdirectory' but deleting them from 'download'? Also, why are you opening the output files in append mode rather than write mode?

Question 4

With the output DataFrames being about 1.5Gb in file, the reason your performance is slow may be that you have insufficient memory to perform these operations quickly.

One way to reduce the amount of memory used is to is to use the inplace keyword argument common to some of the pandas.DataFrame methods. Setting inplace=True, when available, makes the operation augment the DataFrame in place; otherwise, inplace=False will create a copy (which can consume more memory).

One quick suggestion, therefore, might be to replace: df = df.sort_values(by=1) with df.sort_values(by=1, inplace=True).

This answer on StackOverflow may also be of interest (it talks about different optimisations for holding data in DataFrames; including inplace): https://stackoverflow.com/a/39377643/6852391

Question 5

I tried this but it doesn't make much of a difference, the machine I'm working on has 16GB RAM

act act 4262 silver badges5 bronze badges · Accepted Answer · 2017-09-20 12:42:36Z

With the output DataFrames being about 1.5Gb in file, the reason your performance is slow may be that you have insufficient memory to perform these operations quickly.

One way to reduce the amount of memory used is to is to use the inplace keyword argument common to some of the pandas.DataFrame methods. Setting inplace=True, when available, makes the operation augment the DataFrame in place; otherwise, inplace=False will create a copy (which can consume more memory).

One quick suggestion, therefore, might be to replace: df = df.sort_values(by=1) with df.sort_values(by=1, inplace=True).

This answer on StackOverflow may also be of interest (it talks about different optimisations for holding data in DataFrames; including inplace): https://stackoverflow.com/a/39377643/6852391

I tried this but it doesn't make much of a difference, the machine I'm working on has 16GB RAM

Stack Exchange Network

How to speed up ordering of output file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to speed up ordering of output file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions