I'm trying to implement an algorithm able to search for multiple keys through ten huge files in Python (16 million of rows each one). I've got a sorted file with 62 million of keys, and I'm trying to scan each ten file in dataset to look for a key and its value.
Sorted file key:
a
b
c
d
...
Each dataset file appears like this:
a 1
d 10
g 3
...
Output (if key does not appear in file set val to 0):
<sum_vals>,a,<val_1_file>,<val_2_file>,...,<val_10_file>
<sum_vals>,b,<val_1_file>,<val_2_file>,...,<val_10_file>
....
Here it is my Python algorithm. I scan file containing key and for each match I take its value. For each line in dataset where dataset is an array containing name of files, I'm scanning and creating an output line with 0 if key is not in and <val_in_file> if it matches. Everything is working fine on a small set of training data, but it takes long and epic time to process on real data.
import os,sys;
files = ["20140601","20140602","20140603","20140604","20140605","20140606","20140607","20140608","20140609","20140610"]
def printToFile(text):
f = open('processed/20140601','a')
f.write(text)
f.close()
def getKey(line):
data=line.split(" ")
return data[0]+ " "+ data[1]
def getPageSource(line):
data=line.split(" ")
return data[0]
def getPageTitle(line):
data=line.split(" ")
return data[1]
def getPageClicks(line):
data=line.split(" ")
return data[2]
def searchWindow(dataset,key):
isHere = 0;
total = 0;
acc = getPageTitle(key)
for f in dataset:
line_with_keyword = next((line for line in open("dataset/"+f) if key in line),None)
if line_with_keyword is not None:
isHere = 1
click = getPageClicks(line_with_keyword)
if(isHere == 1):
acc = acc.strip('\n') + "," + click
total += int(click)
isHere = 0
else:
acc = acc.strip('\n') + "," + str(0)
total += 0
isHere = 0
printToFile(str(total)+","+acc.strip('\n')+","+"\n")
with open("processed/sorted_keys") as inSortedKey:
for line in inSortedKey:
searchWindow(files,getKey(line).strip("\n"));
1 Answer 1
Some smaller adjustments
By just quickly glancing on the code I would suggest the following actions:
- Move the opening of the output file to where you open the
inSortedKeyfile, and let it be open until finished processing the files. Opening and closing files is rather expensive. - Instead of splitting the line multiple times, do it once in the start of
searchWindow()and use the different elements when you need them. Possibly skip the intermediate variables all together - Simplify the logic related to the
isHereif statements, as you do the same in both cases only that in the second caseclickwill always be 0. Thusly eliminate the duplication of code here - Consider using another level of
with open(...)andfor another_lineinstead of thenext((line for line ... )), and do early return in that loop to skip lines as soon as possible
A little style advise to close off: Remember that the Python style guide indicates to use snake_case for both variables and function names. :-)
Multiple reads of dataset
For each line in inSortedKey file you read all lines from each of the files in the dataset. That is a lot of extra reading of files!!! Lets say that you have \$n\$ lines in inSortedKeys and \$m\$ files in the dataset, where each file has approximately \$p\$ lines. Then your code does \$O(n\cdot m\cdot p)\$ operations within searchWindows.
With your numbers being: \$n=62.000.000\,ドル \$m=10\,ドル and \$p=16.000.000\$.
If you had been able to preread the inSortedKey and extract the keys into a memory table, and then do a single read on each of the other dataset files matching each line against this precomputed key list this number would drop to \$O(n + m \cdot p)\$. That would really speed up your operations, but it does depend on the size of inSortedKey.
Even if you are not able to load all of inSortedKey you would be much better of reading as large a chunk as you are comfortable with into memory, before doing a full read on the dataset. Say you read 100000 lines of inSortedKeys into memory, you would look at \$O(\frac{n}{100000} \cdot m \cdot p)\$ operations doing a slightly heavier operation each time. It would still be a massive improvement.
Consider switching to databases
If you're actually talking about cross-referencing 62 million keys against 160 millions lines in the dataset files, you should seriously consider using databases.
Either permanently, or temporarily. You would most likely get a performance gain if you imported the entire set into a database, did your operations, and wrote the output rather than doing this with plain file operations. Python has excellent support for sqlite or pandas, in addition to multiple other modules and databases, which might help you.
See 200_success' answer for a somewhat related discussion on massive file sizes.
-
\$\begingroup\$ Thank you for your help, I'm reviewing my code and I will edit my question with revision. Here additional information
n=62.000.000m=10p=16.000.000\$\endgroup\$JJack_– JJack_2015年12月08日 03:41:44 +00:00Commented Dec 8, 2015 at 3:41 -
\$\begingroup\$ I reviewed my code according to what you suggested me, but performances are yet not good. I don't know exactly how to implement multiple reads of dataset. Talking about database, I've tested yesterday, It tooks about 55 minutes to load all dato into db (8.3 GB), but It's quite slow to perform each query. \$\endgroup\$JJack_– JJack_2015年12月08日 04:18:32 +00:00Commented Dec 8, 2015 at 4:18
-
\$\begingroup\$ You are not supposed to change the code after posting a question, so Jamal has rolled back your changes. By the way, how long time does your code take for a full run? \$\endgroup\$holroy– holroy2015年12月08日 04:22:57 +00:00Commented Dec 8, 2015 at 4:22
-
-
\$\begingroup\$ The follow-up might be closed as you state that the res doesn't save correctly. And as such the new code is not in a working state, and thusly off-topic for Code Review. So you should reword, or use the latest functioning version you've got! \$\endgroup\$holroy– holroy2015年12月08日 23:08:24 +00:00Commented Dec 8, 2015 at 23:08
You must log in to answer this question.
Explore related questions
See similar questions with these tags.
printToFile, how does that look? And how heavy are thegetPageTitle()andgetPageClicks()? \$\endgroup\$inSortedKeyand keep it open until you are finished. Opening/closing files is expensive... \$\endgroup\$