Efficiently filter a large (100gb+) csv file (v2)

Question 1

EDIT: This question is followed up by this question.

I'm in the process of filtering some very(!) large files (100gb+): I can't download files with a lower granularity. This is a followup from this question.

The problem is as follows: I need to filter large files that look like the following (3b+ rows).

TIC, Date, Time, Bid, Offer
AAPL, 20090901, 09:45, 145, 145.5 
AAPL, 20090902, 09:45, 145, 145.5
AAPL, 20090903, 09:45, 145, 145.5

I filter based on TICKER+DATE combinations found in an external file. I have, on average, ~ 1200 dates of interest per firm for ~ 700 firms. The large file contains all dates for the firms of interest, for which I want to extract only a few dates of interest. The big files are split by month (2013-01, 2013-02 etc.).

AAPL, 20090902
AAPL, 20090903

A few changes were made since the previous post:

I used the CSV module, as was suggested.
I write the the rows to be retained to disk after each 5m rows.
I iterate over the files using a try except statement.

I'm currently at 6 minutes of processing time for 30 million rows (1% of the file); I tested a few files and it works properly. However, with about 3 billion rows per file, that puts it at ~10 hours for one 120gb file. Seeing as I have about twelve files, I'm very curious whether I can get significant performance improvements by doing things differently.

Any tips are greatly appreciated.

import os
import datetime
import csv
import re
ROOT_DIR = "H:/ROOT_DIR/"
SOURCE_FILES = os.path.join(ROOT_DIR, "10. Intradayfiles (source)/")
EXPORT_DIR = os.path.join(ROOT_DIR, "11. CSV Export (step 1 Extract relevant firmdates)/")
DATES_FILE = os.path.join(ROOT_DIR, "10. Dates of interest/firm_date_of_interest.csv")
# Build the original date dict
# For example:
# d['AAPL'] is a list with ['20140901', '20140902', '20140901']
with open(DATES_FILE, "r") as csvfile:
 d = {}
 reader = csv.reader(csvfile)
 reader.next()
 for line in reader:
 firm = line[1]
 date = line[2]
 if firm in d.keys():
 d[firm].append(date)
 else:
 d[firm] = [date]
def main():
 for root, dir, files in os.walk(SOURCE_FILES):
 num_files = len(files)
 for i, file in enumerate(files):
 print('File ' + str(i+1) + '/' + str(num_files) + '; ' + file)
 basename = os.path.splitext(file)[0]
 filepath = os.path.join(root, file)
 # Annotate files with 'DONE' after succesful processing: skip those
 if re.search("DONE", basename):
 continue
 start = datetime.datetime.now()
 rows_to_keep = []
 # Read the file, append only rows for which the dates occurs in the dictionary for that firm.
 with open(filepath, 'rb') as csvfile:
 startfile = datetime.datetime.now()
 reader = csv.reader(csvfile)
 saved = 0
 for i, row in enumerate(reader):
 # Every 5 million rows, I save what we've extracted so far.
 if i % 5000000 == 0:
 if rows_to_keep:
 with open(os.path.join(EXPORT_DIR, basename+' EXTRACT' + str(saved) + '.csv'), 'wb') as csvfile:
 writer = csv.writer(csvfile, quoting=csv.QUOTE_NONNUMERIC)
 for k, line in enumerate(rows_to_keep):
 writer.writerow(line)
 saved += 1
 rows_to_keep = []
 file_elapsed = datetime.datetime.now() - startfile
 print("Took me " + str(file_elapsed.seconds) + ' seconds... for ' + str(i) + ' rows..')
 # See if row[1] (the date) is in the dict, based on row[0] (the ticker)
 try:
 if row[1] in d[row[0]]:
 rows_to_keep.append(row)
 except KeyError:
 continue
 except IndexError:
 continue
 os.rename(os.path.join(root, file), os.path.join(root, os.path.splitext(file)[0]+'- DONE.csv'))
 elapsed = datetime.datetime.now() - start
 print("Took me " + str(elapsed.seconds) + ' seconds...')
 return rows_to_keep
if __name__ == "__main__":
 main()

Question 2

1. How often do you need to carry out this kind of query? 2. How big are the results?

Question 3

This is a one time thing (for a study), I'm interested in improving performance nonetheless as it likely will not be the last time I'll deal with big files. The output of one file is in the range of ~ 20 gb, still with a high granularity (price per second). After reducing the granularity of these files to 20-minute intervals with R, only 50 MB remains!

Question 4

Are you sure this code works? I see a number of "basic" problems I would not expect to see in a working program.

Question 5

@MVersteeg - so with those changes, now tha tyou are actually writing data to disk, and clearing out memory each time, is it faster?

Question 6

You are correct, in prettying up the code for codereview I introduced quite a few naming errors. Am editing the post as we speak, pardon me!

Question 7

You can make the reading of the file a generator function:

def getLines(filename, d):
 with open(filename, "rb") as csvfile:
 datareader = csv.reader(csvfile)
 for row in datareader:
 try:
 if row[1] in d[row[0]]:
 yield row
 except KeyError:
 continue
 except IndexError:
 continue

This extracts the actual filter to a different function.

I would also suggest not accumulating the list of values to write out but instead use the generator to create chunks directly.

Question 8

Note: this is a partial review.

Your script is missing a shebang. I would suggest #!/usr/bin/env pypy to both support virtualenv and suggest that this script will benefit from pypy. It would also be nice to have a test case.

Here's what pylint complains about:

************* Module MVersteeg
C: 7, 0: Line too long (91/80) (line-too-long)
C: 8, 0: Line too long (86/80) (line-too-long)
C: 30, 0: Unnecessary parens after 'print' keyword (superfluous-parens)
C: 41, 0: Line too long (105/80) (line-too-long)
C: 50, 0: Line too long (124/80) (line-too-long)
C: 51, 0: Line too long (90/80) (line-too-long)
C: 57, 0: Line too long (111/80) (line-too-long)
C: 57, 0: Unnecessary parens after 'print' keyword (superfluous-parens)
C: 59, 0: Line too long (91/80) (line-too-long)
C: 67, 0: Line too long (107/80) (line-too-long)
C: 69, 0: Unnecessary parens after 'print' keyword (superfluous-parens)
C: 1, 0: Missing module docstring (missing-docstring)
W: 42,41: Redefining name 'csvfile' from outer scope (line 13) (redefined-outer-name)
W: 29,15: Redefining built-in 'file' (redefined-builtin)
W: 44,16: Redefining name 'reader' from outer scope (line 15) (redefined-outer-name)
W: 52,39: Redefining name 'line' from outer scope (line 17) (redefined-outer-name)
W: 27,14: Redefining built-in 'dir' (redefined-builtin)
C: 26, 0: Missing function docstring (missing-docstring)
R: 26, 0: Too many local variables (20/15) (too-many-locals)
W: 52,36: Unused variable 'k' (unused-variable)
W: 27,14: Unused variable 'dir' (unused-variable)

From my perspective, too much indentation suggests that you should factor out a few functions to make the code more readable. To make it more convenient, you could build a class that would store the inner state of your program.

print('File ' + str(i+1) + '/' + str(num_files) + '; ' + file)

This would be more efficient and readable if format strings were used.

I would also recommend using the pv program to monitor the file reading progress (in your current form, you're probably interested in running pv -d $( pgrep -f yourscriptname ) in a separate terminal).

Please post a testcase so the program can be profiled and you can find your bottleneck.

Question 9

Any tips on where I'd share such an immense data file? Would R code to generate a file suffice? Thanks for the tips!

Question 10

Compress it, it should be smaller then. Actually, a few rows would already be enough for me to run the program and see its behavior.

Question 11

Uploading as we speak, I'll post it within 2 hours with my updated code (based on the answers to this question). I very much appreciate your time and tips. Also, I'm working on a windows laptop: pv is not an option?

Question 12

Being the Unix maniac I am, I would recommend using Cygwin, but I'm not sure if you'll feel comfortable using that. I know I would.

Question 13

Though, I'm not sure if pv -d would work there - it would be better to rework the program to take the data from standard input and then a normal pv usage (without -d) would do. Anyway, that's my approach, YMMV ;)

ratchet freak ratchet freak 13k22 silver badges46 bronze badges · Accepted Answer · 2015-05-04 15:49:06Z

You can make the reading of the file a generator function:

def getLines(filename, d):
 with open(filename, "rb") as csvfile:
 datareader = csv.reader(csvfile)
 for row in datareader:
 try:
 if row[1] in d[row[0]]:
 yield row
 except KeyError:
 continue
 except IndexError:
 continue

This extracts the actual filter to a different function.

I would also suggest not accumulating the list of values to write out but instead use the generator to create chunks directly.

Stack Exchange Network

Efficiently filter a large (100gb+) csv file (v2)

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Efficiently filter a large (100gb+) csv file (v2)

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions