Scanning multiple huge files in Python

Question 1

I'm trying to implement an algorithm able to search for multiple keys through ten huge files in Python (16 million of rows each one). I've got a sorted file with 62 million of keys, and I'm trying to scan each ten file in dataset to look for a key and its value.

Sorted file key:

a 
b
c
d
...

Each dataset file appears like this:

a 1
d 10
g 3
...

Output (if key does not appear in file set val to 0):

<sum_vals>,a,<val_1_file>,<val_2_file>,...,<val_10_file>
<sum_vals>,b,<val_1_file>,<val_2_file>,...,<val_10_file>
....

Here it is my Python algorithm. I scan file containing key and for each match I take its value. For each line in dataset where dataset is an array containing name of files, I'm scanning and creating an output line with 0 if key is not in and <val_in_file> if it matches. Everything is working fine on a small set of training data, but it takes long and epic time to process on real data.

import os,sys;
files = ["20140601","20140602","20140603","20140604","20140605","20140606","20140607","20140608","20140609","20140610"]
def printToFile(text):
 f = open('processed/20140601','a')
 f.write(text)
 f.close()
def getKey(line):
 data=line.split(" ")
 return data[0]+ " "+ data[1]
def getPageSource(line):
 data=line.split(" ")
 return data[0]
def getPageTitle(line):
 data=line.split(" ")
 return data[1]
def getPageClicks(line):
 data=line.split(" ")
 return data[2]
def searchWindow(dataset,key):
 isHere = 0;
 total = 0;
 acc = getPageTitle(key)
 for f in dataset:
 line_with_keyword = next((line for line in open("dataset/"+f) if key in line),None)
 if line_with_keyword is not None:
 isHere = 1
 click = getPageClicks(line_with_keyword)
 if(isHere == 1):
 acc = acc.strip('\n') + "," + click
 total += int(click)
 isHere = 0
 else:
 acc = acc.strip('\n') + "," + str(0)
 total += 0
 isHere = 0
 printToFile(str(total)+","+acc.strip('\n')+","+"\n")
with open("processed/sorted_keys") as inSortedKey:
 for line in inSortedKey:
 searchWindow(files,getKey(line).strip("\n"));

Question 2

You might have issues related to printToFile, how does that look? And how heavy are the getPageTitle() and getPageClicks()?

Question 3

Most likely, you'll increase your response time substantially if you open the file for writing parallell to the inSortedKey and keep it open until you are finished. Opening/closing files is expensive...

Question 4

I've tested it and It takes only 0.05 seconds to update. I think my bottleneck is due to scanning process. Each key on each files. 62.000.000 * (16.000.000 * 10).

Question 5

Multiply 0.05 with number of output lines, and it gets ugly very fast. And the longer the file, the more expensive the appending gets.

Question 6

Follow-up question

Question 7

Some smaller adjustments

By just quickly glancing on the code I would suggest the following actions:

Move the opening of the output file to where you open the inSortedKey file, and let it be open until finished processing the files. Opening and closing files is rather expensive.
Instead of splitting the line multiple times, do it once in the start of searchWindow() and use the different elements when you need them. Possibly skip the intermediate variables all together
Simplify the logic related to the isHere if statements, as you do the same in both cases only that in the second case click will always be 0. Thusly eliminate the duplication of code here
Consider using another level of with open(...) and for another_line instead of the next((line for line ... )), and do early return in that loop to skip lines as soon as possible

A little style advise to close off: Remember that the Python style guide indicates to use snake_case for both variables and function names. :-)

Multiple reads of dataset

For each line in inSortedKey file you read all lines from each of the files in the dataset. That is a lot of extra reading of files!!! Lets say that you have \$n\$ lines in inSortedKeys and \$m\$ files in the dataset, where each file has approximately \$p\$ lines. Then your code does \$O(n\cdot m\cdot p)\$ operations within searchWindows.

With your numbers being: \$n=62.000.000\,ドル \$m=10\,ドル and \$p=16.000.000\$.

If you had been able to preread the inSortedKey and extract the keys into a memory table, and then do a single read on each of the other dataset files matching each line against this precomputed key list this number would drop to \$O(n + m \cdot p)\$. That would really speed up your operations, but it does depend on the size of inSortedKey.

Even if you are not able to load all of inSortedKey you would be much better of reading as large a chunk as you are comfortable with into memory, before doing a full read on the dataset. Say you read 100000 lines of inSortedKeys into memory, you would look at \$O(\frac{n}{100000} \cdot m \cdot p)\$ operations doing a slightly heavier operation each time. It would still be a massive improvement.

Consider switching to databases

If you're actually talking about cross-referencing 62 million keys against 160 millions lines in the dataset files, you should seriously consider using databases.

Either permanently, or temporarily. You would most likely get a performance gain if you imported the entire set into a database, did your operations, and wrote the output rather than doing this with plain file operations. Python has excellent support for sqlite or pandas, in addition to multiple other modules and databases, which might help you.

See 200_success' answer for a somewhat related discussion on massive file sizes.

Question 8

Thank you for your help, I'm reviewing my code and I will edit my question with revision. Here additional information n=62.000.000 m=10 p=16.000.000

Question 9

I reviewed my code according to what you suggested me, but performances are yet not good. I don't know exactly how to implement multiple reads of dataset. Talking about database, I've tested yesterday, It tooks about 55 minutes to load all dato into db (8.3 GB), but It's quite slow to perform each query.

Question 10

You are not supposed to change the code after posting a question, so Jamal has rolled back your changes. By the way, how long time does your code take for a full run?

Question 11

posted a follow-up question here. link

Question 12

The follow-up might be closed as you state that the res doesn't save correctly. And as such the new code is not in a working state, and thusly off-topic for Code Review. So you should reword, or use the latest functioning version you've got!

holroy 11.8k1 gold badge27 silver badges59 bronze badges · Accepted Answer · 2015-12-08 03:20:14Z

Some smaller adjustments

By just quickly glancing on the code I would suggest the following actions:

Move the opening of the output file to where you open the inSortedKey file, and let it be open until finished processing the files. Opening and closing files is rather expensive.
Instead of splitting the line multiple times, do it once in the start of searchWindow() and use the different elements when you need them. Possibly skip the intermediate variables all together
Simplify the logic related to the isHere if statements, as you do the same in both cases only that in the second case click will always be 0. Thusly eliminate the duplication of code here
Consider using another level of with open(...) and for another_line instead of the next((line for line ... )), and do early return in that loop to skip lines as soon as possible

A little style advise to close off: Remember that the Python style guide indicates to use snake_case for both variables and function names. :-)

Multiple reads of dataset

For each line in inSortedKey file you read all lines from each of the files in the dataset. That is a lot of extra reading of files!!! Lets say that you have \$n\$ lines in inSortedKeys and \$m\$ files in the dataset, where each file has approximately \$p\$ lines. Then your code does \$O(n\cdot m\cdot p)\$ operations within searchWindows.

With your numbers being: \$n=62.000.000\,ドル \$m=10\,ドル and \$p=16.000.000\$.

If you had been able to preread the inSortedKey and extract the keys into a memory table, and then do a single read on each of the other dataset files matching each line against this precomputed key list this number would drop to \$O(n + m \cdot p)\$. That would really speed up your operations, but it does depend on the size of inSortedKey.

Even if you are not able to load all of inSortedKey you would be much better of reading as large a chunk as you are comfortable with into memory, before doing a full read on the dataset. Say you read 100000 lines of inSortedKeys into memory, you would look at \$O(\frac{n}{100000} \cdot m \cdot p)\$ operations doing a slightly heavier operation each time. It would still be a massive improvement.

Consider switching to databases

If you're actually talking about cross-referencing 62 million keys against 160 millions lines in the dataset files, you should seriously consider using databases.

Either permanently, or temporarily. You would most likely get a performance gain if you imported the entire set into a database, did your operations, and wrote the output rather than doing this with plain file operations. Python has excellent support for sqlite or pandas, in addition to multiple other modules and databases, which might help you.

See 200_success' answer for a somewhat related discussion on massive file sizes.

Thank you for your help, I'm reviewing my code and I will edit my question with revision. Here additional information n=62.000.000 m=10 p=16.000.000
I reviewed my code according to what you suggested me, but performances are yet not good. I don't know exactly how to implement multiple reads of dataset. Talking about database, I've tested yesterday, It tooks about 55 minutes to load all dato into db (8.3 GB), but It's quite slow to perform each query.
You are not supposed to change the code after posting a question, so Jamal has rolled back your changes. By the way, how long time does your code take for a full run?
The follow-up might be closed as you state that the res doesn't save correctly. And as such the new code is not in a working state, and thusly off-topic for Code Review. So you should reword, or use the latest functioning version you've got!

Stack Exchange Network

Scanning multiple huge files in Python

1 Answer 1

Some smaller adjustments

Multiple reads of dataset

Consider switching to databases

You must log in to answer this question.

Linked

Hot Network Questions

Scanning multiple huge files in Python

1 Answer 1

Some smaller adjustments

Multiple reads of dataset

Consider switching to databases

You must log in to answer this question.

Linked

Related

Hot Network Questions