Why is this program for extracting IDs from a file so slow?

Question 1

This is roughly what my data file looks like:

# Monid U B V R I u g r i J Jerr H Herr K Kerr IRAC1 I1err IRAC2 I2err IRAC3 I3err IRAC4 I4err MIPS24 M24err SpT HaEW mem comp
Mon-000001 99.999 99.999 21.427 99.999 18.844 99.999 99.999 99.999 99.999 16.144 99.999 15.809 0.137 16.249 99.999 15.274 0.033 15.286 0.038 99.999 99.999 99.999 99.999 99.999 99.999 null 55.000 1 N
Mon-000002 99.999 99.999 20.905 19.410 17.517 99.999 99.999 99.999 99.999 15.601 0.080 15.312 0.100 14.810 0.110 14.467 0.013 14.328 0.019 14.276 0.103 99.999 0.048 99.999 99.999 null -99.999 2 N

...and it's a total of 31mb in size. Here's my python script that pulls the Mon-###### IDs (found at the beginning of each of the lines).

import re
def pullIDs(file_input):
 '''Pulls Mon-IDs from input file.'''
 arrayID = []
 with open(file_input,'rU') as user_file:
 for line in user_file:
 arrayID.append(re.findall('Mon\-\d{6}',line))
 return arrayID
print pullIDs(raw_input("Enter your first file: "))

The script works but for this particular file it ran for well into 5 minutes and I eventually just killed the process due to impatience. Is this just something I'll have to deal with in python? i.e. Should this be written with a compiled language considering the size of my data file?

Further info: This script is being run within Emacs. This, by the checked answer, explains why it was running so slow.

Question 2

I can't reproduce your problem. When I tried it, your program reads the ids from a 31 MiB file in less than a second. So I think there must be something you're not telling us.

Question 3

The only thing that might be of interest is that I'm running it through Emacs on a macbook pro.

Question 4

Me too, so that can't be it. Can you post a self-contained test case? For example, you could write some code that generates 31 MiB of test data, and then we could check our timing against yours on the same data.

Question 5

Hmmmm. I'd like to do that, but I wouldn't even know where to start with it. That's a little bit advanced for my level. I'd like to learn though if you wouldn't mine explaining how to go about doing that...

Question 6

@GarethRees A second thing I just discovered...if I run the script in my terminal it finishes in about 2 seconds. It also completes within Eclipse Kepler within about 3 to 4 seconds. Hmmmm, Emacs?

Question 7

You said in comments that you don't know how to create a self-contained test case. But that's really easy! All that's needed is a function like this:

def test_case(filename, n):
 """Write n lines of test data to filename."""
 with open(filename, 'w') as f:
 for i in range(n):
 f.write('Mon-{0:06d} {1}\n'.format(i + 1, ' 99.999' * 20))

You can use this to make a test case of about the right size:

>>> test_case('cr36275.data', 200000)
>>> import os
>>> os.stat('cr36275.data').st_size
34400000

That's about 34 MiB so close enough. Now we can see how fast your code really is, using the timeit module:

>>> from timeit import timeit
>>> timeit(lambda:pullIDs('cr36275.data'), number=1)
1.3354740142822266

Just over a second. There's nothing wrong with your code or the speed of Python.

So why does it take you many minutes? Well, you say that you're running it inside Emacs. That means that when you run

>>> pullIDs('cr36275.data')

Python prints out a list of 200,000 ids, and Emacs reads this line of output into the *Python* buffer and applies syntax highlighting rules to it as it goes. Emacs' syntax highlighting code is designed to work on lines of source code (at most a few hundred characters but mostly 80 characters or less), not on lines of output that are millions of characters long. This is what is taking all the time.

So don't do that. Read the list of ids into a variable and if you need to look at it, use slicing to look at bits of it:

>>> ids = pullIDs('cr36275.data')
>>> ids[:10]
[['Mon-000001'], ['Mon-000002'], ['Mon-000003'], ['Mon-000004'], ['Mon-000005'],
 ['Mon-000006'], ['Mon-000007'], ['Mon-000008'], ['Mon-000009'], ['Mon-000010']]

Question 8

Wow. Thank you so much for this. It isn't crucial that I print these IDs to my screen but I wanted to see if my script was working correctly and this crazy slow down kind of confused me. But, you've cleared all of that up. Thanks for the test-case pointers and the insight into Emacs.

Question 9

If you know that Mon-###### IDS will always be in the first part of each line there is no need to .findall(..), just extract the first 10 characters from the line (granted the Mon-###### is never more than 10 characters

Question 10

If it isn't though would this be the only way? I only ask because I plan on using this on multiple files, some of which may not have the IDs in the first part of the line. Thanks.

Question 11

Another thing could be to change .findall(..) to only finding the first instance (if there are some kind of .find() or .findfirst()). The difference would be that instead of traversing the entire line it only traverses it until it's found, a small improvement but I guess it adds up.

Question 12

@Max: Did you try any of these suggestions? How much improvement to the runtime did they make?

Question 13

No I didn't try it, and I don't expect it to be any major improvements, but when you're after performance boosts I guess it all adds up.

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2013-11-29 12:08:31Z

You said in comments that you don't know how to create a self-contained test case. But that's really easy! All that's needed is a function like this:

def test_case(filename, n):
 """Write n lines of test data to filename."""
 with open(filename, 'w') as f:
 for i in range(n):
 f.write('Mon-{0:06d} {1}\n'.format(i + 1, ' 99.999' * 20))

You can use this to make a test case of about the right size:

>>> test_case('cr36275.data', 200000)
>>> import os
>>> os.stat('cr36275.data').st_size
34400000

That's about 34 MiB so close enough. Now we can see how fast your code really is, using the timeit module:

>>> from timeit import timeit
>>> timeit(lambda:pullIDs('cr36275.data'), number=1)
1.3354740142822266

Just over a second. There's nothing wrong with your code or the speed of Python.

So why does it take you many minutes? Well, you say that you're running it inside Emacs. That means that when you run

>>> pullIDs('cr36275.data')

Python prints out a list of 200,000 ids, and Emacs reads this line of output into the *Python* buffer and applies syntax highlighting rules to it as it goes. Emacs' syntax highlighting code is designed to work on lines of source code (at most a few hundred characters but mostly 80 characters or less), not on lines of output that are millions of characters long. This is what is taking all the time.

So don't do that. Read the list of ids into a variable and if you need to look at it, use slicing to look at bits of it:

>>> ids = pullIDs('cr36275.data')
>>> ids[:10]
[['Mon-000001'], ['Mon-000002'], ['Mon-000003'], ['Mon-000004'], ['Mon-000005'],
 ['Mon-000006'], ['Mon-000007'], ['Mon-000008'], ['Mon-000009'], ['Mon-000010']]

Wow. Thank you so much for this. It isn't crucial that I print these IDs to my screen but I wanted to see if my script was working correctly and this crazy slow down kind of confused me. But, you've cleared all of that up. Thanks for the test-case pointers and the insight into Emacs.

Stack Exchange Network

Why is this program for extracting IDs from a file so slow?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Why is this program for extracting IDs from a file so slow?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions