3

I want this code to work fast.

import re
with open('largetextfile.txt') as f:
 for line in f:
 pattern = re.compile("^1234567")
 if pattern.match(line):
 print (line)

takes 19 seconds.

I modified it:

import re
with open('largetextfile.txt') as f:
 for line in f:
 if "1234567" in line:
 pattern = re.compile("^1234567")
 if pattern.match(line):
 print (line)

takes 7 seconds.

So the question is, is there any better way?

I got two ideas from community and based on that I asked the detailed question at: https://codereview.stackexchange.com/questions/135159/python-search-for-array-in-large-text-file

asked Jul 16, 2016 at 12:00
6
  • Small thing to change is to take the pattern definition out of the loop Commented Jul 16, 2016 at 12:04
  • hmm. Let me check. Commented Jul 16, 2016 at 12:05
  • got one second benefit. Commented Jul 16, 2016 at 12:06
  • 1
    Maybe instead of checking if "1234567" in line: chech if the first 7 characters in line are equal to "1234567" as a string (no in). Commented Jul 16, 2016 at 12:09
  • 1
    You are compiling the pattern on each iteration. Take it out of the loop. Commented Jul 16, 2016 at 13:08

3 Answers 3

4

Check if this matches your requirement:

with open('largetextfile.txt') as f:
 for line in f:
 if line.startswith('1234567'):
 print line
answered Jul 16, 2016 at 12:10

2 Comments

@scripting.filesystemobject can you check time on this answer. I'm guessing it is much faster than yours
Almost half than my code. I will validate it. Thanks
1

Since you're matching a string you don't need regular expressions, so you can use this

with open('bigfile.txt') as f:
 for line in f: 
 if line[:7]=="1234567": 
 print (line)

I noticed that using string slicing is slightly faster than startswith and found out this has been discussed here

answered Jul 16, 2016 at 13:15

1 Comment

Thanks. It is faster.
1

In order to perform tests, I copied in a file AAA.txt the following text of 6,31 MB and around 128.000 lines:
http://norvig.com/big.txt
Then with the help of random module, I changed it to a file BBB.txt by randomly inserting '1234567' at the starts of 1000 lines of it.

I tested several solutions on this modified text.

I can't discriminate which one of the following ones is the fastest, but I think they're all faster than other solutions that I read in this page and other solutions of mine.

They are based on the fact that the "in"-test 'string' in 'anotherstring' is tremendously fast.

def in_and_startswith(x):
 return '1234567' in x and x.startswith('1234567')
with open('BBB.txt') as f:
 for line in filter(in_and_startswith, f):
 x=0

.

def in_and_find(x):
 return '1234567' in x and x.find('1234567')==0
with open('BBB.txt') as f:
 for line in filter(in_and_find, f):
 x=0

.

def just_in(x):
 return '1234567' in x
with open('BBB.txt') as f:
 for line in filter(just_in, f):
 if line.startswith('1234567'):
 x=0
with open('BBB.txt') as f:
 for line in filter(just_in, f):
 if line.find('1234567')==0:
 x=0

Note that I tested with just the instruction x=0 that has no particular sense, to avoid instruction print(line) because print() is an instruction that takes a long time to execute. So repeating several print() instructions is much longer than printing just one string obtained as joining all the strings that must be printed.

Test the execution times of

for x in ['hkjh','kjhoi','3135487j','kjhskdkfh','54545779']:
 print(x)

and

print('\n'.join(x for x i['hkjh','kjhoi','313587j','kjhskdkfh','54545779']))

you'll see the difference

answered Jul 16, 2016 at 14:49

2 Comments

Thanks for huge help.
It will take some time for me to understand.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.