superfast regexmatch in large text file

Question 1

I want this code to work fast.

import re
with open('largetextfile.txt') as f:
 for line in f:
 pattern = re.compile("^1234567")
 if pattern.match(line):
 print (line)

takes 19 seconds.

I modified it:

import re
with open('largetextfile.txt') as f:
 for line in f:
 if "1234567" in line:
 pattern = re.compile("^1234567")
 if pattern.match(line):
 print (line)

takes 7 seconds.

So the question is, is there any better way?

I got two ideas from community and based on that I asked the detailed question at: https://codereview.stackexchange.com/questions/135159/python-search-for-array-in-large-text-file

Question 2

Small thing to change is to take the pattern definition out of the loop

Question 3

hmm. Let me check.

Question 4

got one second benefit.

Question 5

Maybe instead of checking if "1234567" in line: chech if the first 7 characters in line are equal to "1234567" as a string (no in).

Question 6

You are compiling the pattern on each iteration. Take it out of the loop.

Question 7

Check if this matches your requirement:

with open('largetextfile.txt') as f:
 for line in f:
 if line.startswith('1234567'):
 print line

Question 8

@scripting.filesystemobject can you check time on this answer. I'm guessing it is much faster than yours

Question 9

Almost half than my code. I will validate it. Thanks

Question 10

Since you're matching a string you don't need regular expressions, so you can use this

with open('bigfile.txt') as f:
 for line in f: 
 if line[:7]=="1234567": 
 print (line)

I noticed that using string slicing is slightly faster than startswith and found out this has been discussed here

Question 11

Thanks. It is faster.

Question 12

In order to perform tests, I copied in a file AAA.txt the following text of 6,31 MB and around 128.000 lines:
http://norvig.com/big.txt
Then with the help of random module, I changed it to a file BBB.txt by randomly inserting '1234567' at the starts of 1000 lines of it.

I tested several solutions on this modified text.

I can't discriminate which one of the following ones is the fastest, but I think they're all faster than other solutions that I read in this page and other solutions of mine.

They are based on the fact that the "in"-test 'string' in 'anotherstring' is tremendously fast.

def in_and_startswith(x):
 return '1234567' in x and x.startswith('1234567')
with open('BBB.txt') as f:
 for line in filter(in_and_startswith, f):
 x=0

.

def in_and_find(x):
 return '1234567' in x and x.find('1234567')==0
with open('BBB.txt') as f:
 for line in filter(in_and_find, f):
 x=0

.

def just_in(x):
 return '1234567' in x
with open('BBB.txt') as f:
 for line in filter(just_in, f):
 if line.startswith('1234567'):
 x=0
with open('BBB.txt') as f:
 for line in filter(just_in, f):
 if line.find('1234567')==0:
 x=0

Note that I tested with just the instruction x=0 that has no particular sense, to avoid instruction print(line) because print() is an instruction that takes a long time to execute. So repeating several print() instructions is much longer than printing just one string obtained as joining all the strings that must be printed.

Test the execution times of

for x in ['hkjh','kjhoi','3135487j','kjhskdkfh','54545779']:
 print(x)

and

print('\n'.join(x for x i['hkjh','kjhoi','313587j','kjhskdkfh','54545779']))

you'll see the difference

Question 13

Thanks for huge help.

Question 14

It will take some time for me to understand.

shiva shiva 2,7572 gold badges21 silver badges37 bronze badges · Accepted Answer · 2016-07-16 12:10:28Z

4

Check if this matches your requirement:

with open('largetextfile.txt') as f:
 for line in f:
 if line.startswith('1234567'):
 print line

Share

Improve this answer

answered Jul 16, 2016 at 12:10

shiva's user avatar

shiva shiva

2,7572 gold badges21 silver badges37 bronze badges

2 Comments

joel goldstick

joel goldstick Over a year ago

@scripting.filesystemobject can you check time on this answer. I'm guessing it is much faster than yours

2016年07月16日T12:13:10.533Z+00:00

Rahul

Rahul Over a year ago

Almost half than my code. I will validate it. Thanks

2016年07月16日T12:14:39.877Z+00:00

CollectivesTM on Stack Overflow

superfast regexmatch in large text file

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related