EDIT: This question is followed up by this question.
I'm in the process of filtering some very(!) large files (100gb+): I can't download files with a lower granularity. This is a followup from this question.
The problem is as follows: I need to filter large files that look like the following (3b+ rows).
TIC, Date, Time, Bid, Offer AAPL, 20090901, 09:45, 145, 145.5 AAPL, 20090902, 09:45, 145, 145.5 AAPL, 20090903, 09:45, 145, 145.5
I filter based on TICKER+DATE combinations found in an external file. I have, on average, ~ 1200 dates of interest per firm for ~ 700 firms. The large file contains all dates for the firms of interest, for which I want to extract only a few dates of interest. The big files are split by month (2013-01, 2013-02 etc.).
AAPL, 20090902 AAPL, 20090903
A few changes were made since the previous post:
- I used the CSV module, as was suggested.
- I write the the rows to be retained to disk after each 5m rows.
- I iterate over the files using a try except statement.
I'm currently at 6 minutes of processing time for 30 million rows (1% of the file); I tested a few files and it works properly. However, with about 3 billion rows per file, that puts it at ~10 hours for one 120gb file. Seeing as I have about twelve files, I'm very curious whether I can get significant performance improvements by doing things differently.
Any tips are greatly appreciated.
import os
import datetime
import csv
import re
ROOT_DIR = "H:/ROOT_DIR/"
SOURCE_FILES = os.path.join(ROOT_DIR, "10. Intradayfiles (source)/")
EXPORT_DIR = os.path.join(ROOT_DIR, "11. CSV Export (step 1 Extract relevant firmdates)/")
DATES_FILE = os.path.join(ROOT_DIR, "10. Dates of interest/firm_date_of_interest.csv")
# Build the original date dict
# For example:
# d['AAPL'] is a list with ['20140901', '20140902', '20140901']
with open(DATES_FILE, "r") as csvfile:
d = {}
reader = csv.reader(csvfile)
reader.next()
for line in reader:
firm = line[1]
date = line[2]
if firm in d.keys():
d[firm].append(date)
else:
d[firm] = [date]
def main():
for root, dir, files in os.walk(SOURCE_FILES):
num_files = len(files)
for i, file in enumerate(files):
print('File ' + str(i+1) + '/' + str(num_files) + '; ' + file)
basename = os.path.splitext(file)[0]
filepath = os.path.join(root, file)
# Annotate files with 'DONE' after succesful processing: skip those
if re.search("DONE", basename):
continue
start = datetime.datetime.now()
rows_to_keep = []
# Read the file, append only rows for which the dates occurs in the dictionary for that firm.
with open(filepath, 'rb') as csvfile:
startfile = datetime.datetime.now()
reader = csv.reader(csvfile)
saved = 0
for i, row in enumerate(reader):
# Every 5 million rows, I save what we've extracted so far.
if i % 5000000 == 0:
if rows_to_keep:
with open(os.path.join(EXPORT_DIR, basename+' EXTRACT' + str(saved) + '.csv'), 'wb') as csvfile:
writer = csv.writer(csvfile, quoting=csv.QUOTE_NONNUMERIC)
for k, line in enumerate(rows_to_keep):
writer.writerow(line)
saved += 1
rows_to_keep = []
file_elapsed = datetime.datetime.now() - startfile
print("Took me " + str(file_elapsed.seconds) + ' seconds... for ' + str(i) + ' rows..')
# See if row[1] (the date) is in the dict, based on row[0] (the ticker)
try:
if row[1] in d[row[0]]:
rows_to_keep.append(row)
except KeyError:
continue
except IndexError:
continue
os.rename(os.path.join(root, file), os.path.join(root, os.path.splitext(file)[0]+'- DONE.csv'))
elapsed = datetime.datetime.now() - start
print("Took me " + str(elapsed.seconds) + ' seconds...')
return rows_to_keep
if __name__ == "__main__":
main()
-
\$\begingroup\$ 1. How often do you need to carry out this kind of query? 2. How big are the results? \$\endgroup\$Gareth Rees– Gareth Rees2015年05月04日 13:05:00 +00:00Commented May 4, 2015 at 13:05
-
\$\begingroup\$ This is a one time thing (for a study), I'm interested in improving performance nonetheless as it likely will not be the last time I'll deal with big files. The output of one file is in the range of ~ 20 gb, still with a high granularity (price per second). After reducing the granularity of these files to 20-minute intervals with R, only 50 MB remains! \$\endgroup\$MattV– MattV2015年05月04日 13:10:44 +00:00Commented May 4, 2015 at 13:10
-
\$\begingroup\$ Are you sure this code works? I see a number of "basic" problems I would not expect to see in a working program. \$\endgroup\$rolfl– rolfl2015年05月04日 14:00:43 +00:00Commented May 4, 2015 at 14:00
-
\$\begingroup\$ @MVersteeg - so with those changes, now tha tyou are actually writing data to disk, and clearing out memory each time, is it faster? \$\endgroup\$rolfl– rolfl2015年05月04日 14:05:13 +00:00Commented May 4, 2015 at 14:05
-
\$\begingroup\$ You are correct, in prettying up the code for codereview I introduced quite a few naming errors. Am editing the post as we speak, pardon me! \$\endgroup\$MattV– MattV2015年05月04日 14:05:58 +00:00Commented May 4, 2015 at 14:05
2 Answers 2
You can make the reading of the file a generator function:
def getLines(filename, d):
with open(filename, "rb") as csvfile:
datareader = csv.reader(csvfile)
for row in datareader:
try:
if row[1] in d[row[0]]:
yield row
except KeyError:
continue
except IndexError:
continue
This extracts the actual filter to a different function.
I would also suggest not accumulating the list of values to write out but instead use the generator to create chunks directly.
Note: this is a partial review.
Your script is missing a shebang. I would suggest #!/usr/bin/env pypy
to both support virtualenv and suggest that this script will benefit from pypy. It would also be nice to have a test case.
Here's what pylint
complains about:
************* Module MVersteeg
C: 7, 0: Line too long (91/80) (line-too-long)
C: 8, 0: Line too long (86/80) (line-too-long)
C: 30, 0: Unnecessary parens after 'print' keyword (superfluous-parens)
C: 41, 0: Line too long (105/80) (line-too-long)
C: 50, 0: Line too long (124/80) (line-too-long)
C: 51, 0: Line too long (90/80) (line-too-long)
C: 57, 0: Line too long (111/80) (line-too-long)
C: 57, 0: Unnecessary parens after 'print' keyword (superfluous-parens)
C: 59, 0: Line too long (91/80) (line-too-long)
C: 67, 0: Line too long (107/80) (line-too-long)
C: 69, 0: Unnecessary parens after 'print' keyword (superfluous-parens)
C: 1, 0: Missing module docstring (missing-docstring)
W: 42,41: Redefining name 'csvfile' from outer scope (line 13) (redefined-outer-name)
W: 29,15: Redefining built-in 'file' (redefined-builtin)
W: 44,16: Redefining name 'reader' from outer scope (line 15) (redefined-outer-name)
W: 52,39: Redefining name 'line' from outer scope (line 17) (redefined-outer-name)
W: 27,14: Redefining built-in 'dir' (redefined-builtin)
C: 26, 0: Missing function docstring (missing-docstring)
R: 26, 0: Too many local variables (20/15) (too-many-locals)
W: 52,36: Unused variable 'k' (unused-variable)
W: 27,14: Unused variable 'dir' (unused-variable)
From my perspective, too much indentation suggests that you should factor out a few functions to make the code more readable. To make it more convenient, you could build a class that would store the inner state of your program.
print('File ' + str(i+1) + '/' + str(num_files) + '; ' + file)
This would be more efficient and readable if format strings were used.
I would also recommend using the pv
program to monitor the file reading progress (in your current form, you're probably interested in running pv -d $( pgrep -f yourscriptname )
in a separate terminal).
Please post a testcase so the program can be profiled and you can find your bottleneck.
-
\$\begingroup\$ Any tips on where I'd share such an immense data file? Would R code to generate a file suffice? Thanks for the tips! \$\endgroup\$MattV– MattV2015年05月05日 12:01:27 +00:00Commented May 5, 2015 at 12:01
-
\$\begingroup\$ Compress it, it should be smaller then. Actually, a few rows would already be enough for me to run the program and see its behavior. \$\endgroup\$d33tah– d33tah2015年05月05日 12:50:45 +00:00Commented May 5, 2015 at 12:50
-
\$\begingroup\$ Uploading as we speak, I'll post it within 2 hours with my updated code (based on the answers to this question). I very much appreciate your time and tips. Also, I'm working on a windows laptop: pv is not an option? \$\endgroup\$MattV– MattV2015年05月05日 13:29:07 +00:00Commented May 5, 2015 at 13:29
-
\$\begingroup\$ Being the Unix maniac I am, I would recommend using Cygwin, but I'm not sure if you'll feel comfortable using that. I know I would. \$\endgroup\$d33tah– d33tah2015年05月05日 13:34:58 +00:00Commented May 5, 2015 at 13:34
-
\$\begingroup\$ Though, I'm not sure if
pv -d
would work there - it would be better to rework the program to take the data from standard input and then a normalpv
usage (without-d
) would do. Anyway, that's my approach, YMMV ;) \$\endgroup\$d33tah– d33tah2015年05月05日 13:35:58 +00:00Commented May 5, 2015 at 13:35