I asked similar question at: https://stackoverflow.com/questions/38410982/superfast-regexmatch-in-large-text-file
from datetime import datetime
startTime = datetime.now()
f = open('data.txt', 'r')
t = f.read().splitlines()
paid = list(set(t))
with open('f.txt') as f:
for line in f:
for x in paid:
if line[:7]==x:
print (line)
print (datetime.now() - startTime)
##=>0:11:05.542505
from datetime import datetime
startTime = datetime.now()
f = open('data.txt', 'r')
t = f.read().splitlines()
paid = list(set(t))
with open('f.txt') as f:
for line in f:
for x in paid:
if line.startswith(x):
print (line)
print (datetime.now() - startTime)
##=>0:11:32.997729
Is there any faster way?
f.txt has ~ 14 million lines, and looks like this:
4287053 06218896 N 19801222 19810901 19881222 M171
4287053 06218896 N 19801222 19810901 19850211 M170
4289713 06222552 Y 19810105 19810915 19930330 SM02
4289713 06222552 Y 19810105 19810915 19930303 M285
4289713 06222552 Y 19810105 19810915 19921208 RMPN
4289713 06222552 Y 19810105 19810915 19921208 ASPN
4289713 06222552 Y 19810105 19810915 19881116 ASPN
4289713 06222552 Y 19810105 19810915 19881107 M171
4289713 06222552 Y 19810105 19810915 19850306 M170
4291808 06215853 N 19801212 19810929 19851031 EXP.
4291808 06215853 N 19801212 19810929 19850812 REM.
4292069 06227825 N 19810123 19810929 19930926 EXP.
4292069 06227825 N 19810123 19810929 19890323 ASPN
4292069 06227825 N 19810123 19810929 19890320 M171
4292069 06227825 N 19810123 19810929 19850314 M170
4292142 06224175 N 19810112 19810929 19930926 EXP.
4292142 06224175 N 19810112 19810929 19890316 M171
4292142 06224175 N 19810112 19810929 19861008 ASPN
4292142 06224175 N 19810112 19810929 19850925 M170
4292142 06224175 N 19810112 19810929 19850925 M176
4292142 06224175 N 19810112 19810929 19850812 REM.
...
RE45962 14454334 Y 20140807 20160405 20160323 ASPN
RE45972 14335639 N 20140718 20160412 20160512 M1551
RE45975 14464421 N 20140820 20160412 20160511 M1551
RE45975 14464421 N 20140820 20160412 20160510 ASPN
RE46021 13775962 N 20130225 20160531 20160621 M1551
RE46028 14491699 N 20140919 20160614 20160621 STOL
RE46046 13755710 N 20130131 20160628 20160624 ASPN
RE46051 10137107 N 20020502 20160705 20160624 ASPN
RE46074 14249009 N 20140617 20160719 20160614 ASPN
data.txt has ~ 200 lines, and looks like this:
6268343
6268343
6268343
6268343
6268343
6268343
7749955
8710181
6268343
6384016
6458924
...
1 Answer 1
Performance
It should be possible to accomplish this task in seconds rather than minutes, with the right data structure. This is your main mistake:
paid = list(set(t))
The problem is, for a list with n items, it takes O(n) time to check whether the list contains a particular item. It's particularly bad if the vast majority of the entries that you are seeking do not appear in the list — then it really does need to look at every single member of the list before concluding that an item is not in the list.
On the other hand, testing whether a particular element is a member of a set
is O(1) — extremely efficient!
Miscellaneous
Please avoid cryptic filenames such as x
and t
. Calling the file handle f
is OK, since f
is a typical name for a file handle, but since you have two files in this program, more distinctive names would help readability. I would also avoid renaming the patent data file to f.txt
, which is too cryptic.
It's a good habit to always call open()
in the context of a with
block, so that the file handle will be automatically closed for sure. The 'r'
mode is implied, and can be omitted.
Suggested solution
with open('data.txt') as paid_file:
paid = set(line.rstrip() for line in paid_file)
with open('MaintFeeEvents_20160711.txt') as patent_events_file:
for line in patent_events_file:
if line.split(' ', 1)[0] in paid:
print(line, end='')
-
2\$\begingroup\$ What? Are you kidding? 0:00:14.917000 compared to 0:11:05.542505? How? looks like same code to me. really amazed. \$\endgroup\$Rahul Patel– Rahul Patel2016年07月18日 08:53:15 +00:00Commented Jul 18, 2016 at 8:53
-
\$\begingroup\$ @Scripting.FileSystemObject The big difference is: for each line in f.txt, your code loops over all elements in paid; the solution code instead does a lookup in paid (which is a set there); that is much faster. \$\endgroup\$Roel Schroeven– Roel Schroeven2016年07月18日 11:59:12 +00:00Commented Jul 18, 2016 at 11:59
Explore related questions
See similar questions with these tags.