Python search for array in large text file

Question 1

I asked similar question at: https://stackoverflow.com/questions/38410982/superfast-regexmatch-in-large-text-file

from datetime import datetime
startTime = datetime.now()
f = open('data.txt', 'r')
t = f.read().splitlines()
paid = list(set(t))
with open('f.txt') as f:
 for line in f:
 for x in paid:
 if line[:7]==x:
 print (line)
print (datetime.now() - startTime)
##=>0:11:05.542505

from datetime import datetime
startTime = datetime.now()
f = open('data.txt', 'r')
t = f.read().splitlines()
paid = list(set(t))
with open('f.txt') as f:
 for line in f:
 for x in paid:
 if line.startswith(x):
 print (line)
print (datetime.now() - startTime)
##=>0:11:32.997729

Is there any faster way?

f.txt has ~ 14 million lines, and looks like this:

4287053 06218896 N 19801222 19810901 19881222 M171 
4287053 06218896 N 19801222 19810901 19850211 M170 
4289713 06222552 Y 19810105 19810915 19930330 SM02 
4289713 06222552 Y 19810105 19810915 19930303 M285 
4289713 06222552 Y 19810105 19810915 19921208 RMPN 
4289713 06222552 Y 19810105 19810915 19921208 ASPN 
4289713 06222552 Y 19810105 19810915 19881116 ASPN 
4289713 06222552 Y 19810105 19810915 19881107 M171 
4289713 06222552 Y 19810105 19810915 19850306 M170 
4291808 06215853 N 19801212 19810929 19851031 EXP. 
4291808 06215853 N 19801212 19810929 19850812 REM. 
4292069 06227825 N 19810123 19810929 19930926 EXP. 
4292069 06227825 N 19810123 19810929 19890323 ASPN 
4292069 06227825 N 19810123 19810929 19890320 M171 
4292069 06227825 N 19810123 19810929 19850314 M170 
4292142 06224175 N 19810112 19810929 19930926 EXP. 
4292142 06224175 N 19810112 19810929 19890316 M171 
4292142 06224175 N 19810112 19810929 19861008 ASPN 
4292142 06224175 N 19810112 19810929 19850925 M170 
4292142 06224175 N 19810112 19810929 19850925 M176 
4292142 06224175 N 19810112 19810929 19850812 REM. 
...
RE45962 14454334 Y 20140807 20160405 20160323 ASPN 
RE45972 14335639 N 20140718 20160412 20160512 M1551
RE45975 14464421 N 20140820 20160412 20160511 M1551
RE45975 14464421 N 20140820 20160412 20160510 ASPN 
RE46021 13775962 N 20130225 20160531 20160621 M1551
RE46028 14491699 N 20140919 20160614 20160621 STOL 
RE46046 13755710 N 20130131 20160628 20160624 ASPN 
RE46051 10137107 N 20020502 20160705 20160624 ASPN 
RE46074 14249009 N 20140617 20160719 20160614 ASPN

data.txt has ~ 200 lines, and looks like this:

Question 2

Performance

It should be possible to accomplish this task in seconds rather than minutes, with the right data structure. This is your main mistake:

paid = list(set(t))

The problem is, for a list with n items, it takes O(n) time to check whether the list contains a particular item. It's particularly bad if the vast majority of the entries that you are seeking do not appear in the list — then it really does need to look at every single member of the list before concluding that an item is not in the list.

On the other hand, testing whether a particular element is a member of a set is O(1) — extremely efficient!

Miscellaneous

Please avoid cryptic filenames such as x and t. Calling the file handle f is OK, since f is a typical name for a file handle, but since you have two files in this program, more distinctive names would help readability. I would also avoid renaming the patent data file to f.txt, which is too cryptic.

It's a good habit to always call open() in the context of a with block, so that the file handle will be automatically closed for sure. The 'r' mode is implied, and can be omitted.

Performance

It should be possible to accomplish this task in seconds rather than minutes, with the right data structure. This is your main mistake:

paid = list(set(t))

The problem is, for a list with n items, it takes O(n) time to check whether the list contains a particular item. It's particularly bad if the vast majority of the entries that you are seeking do not appear in the list — then it really does need to look at every single member of the list before concluding that an item is not in the list.

On the other hand, testing whether a particular element is a member of a set is O(1) — extremely efficient!

Miscellaneous

Please avoid cryptic filenames such as x and t. Calling the file handle f is OK, since f is a typical name for a file handle, but since you have two files in this program, more distinctive names would help readability. I would also avoid renaming the patent data file to f.txt, which is too cryptic.

It's a good habit to always call open() in the context of a with block, so that the file handle will be automatically closed for sure. The 'r' mode is implied, and can be omitted.

Stack Exchange Network

Python search for array in large text file

1 Answer 1

Performance

Miscellaneous

Suggested solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python search for array in large text file

1 Answer 1

Performance

Miscellaneous

Suggested solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions