4
\$\begingroup\$

This is roughly what my data file looks like:

# Monid U B V R I u g r i J Jerr H Herr K Kerr IRAC1 I1err IRAC2 I2err IRAC3 I3err IRAC4 I4err MIPS24 M24err SpT HaEW mem comp
Mon-000001 99.999 99.999 21.427 99.999 18.844 99.999 99.999 99.999 99.999 16.144 99.999 15.809 0.137 16.249 99.999 15.274 0.033 15.286 0.038 99.999 99.999 99.999 99.999 99.999 99.999 null 55.000 1 N
Mon-000002 99.999 99.999 20.905 19.410 17.517 99.999 99.999 99.999 99.999 15.601 0.080 15.312 0.100 14.810 0.110 14.467 0.013 14.328 0.019 14.276 0.103 99.999 0.048 99.999 99.999 null -99.999 2 N

...and it's a total of 31mb in size. Here's my python script that pulls the Mon-###### IDs (found at the beginning of each of the lines).

import re
def pullIDs(file_input):
 '''Pulls Mon-IDs from input file.'''
 arrayID = []
 with open(file_input,'rU') as user_file:
 for line in user_file:
 arrayID.append(re.findall('Mon\-\d{6}',line))
 return arrayID
print pullIDs(raw_input("Enter your first file: "))

The script works but for this particular file it ran for well into 5 minutes and I eventually just killed the process due to impatience. Is this just something I'll have to deal with in python? i.e. Should this be written with a compiled language considering the size of my data file?

Further info: This script is being run within Emacs. This, by the checked answer, explains why it was running so slow.

Gareth Rees
50.1k3 gold badges130 silver badges210 bronze badges
asked Nov 28, 2013 at 7:51
\$\endgroup\$
5
  • 1
    \$\begingroup\$ I can't reproduce your problem. When I tried it, your program reads the ids from a 31 MiB file in less than a second. So I think there must be something you're not telling us. \$\endgroup\$ Commented Nov 28, 2013 at 12:24
  • \$\begingroup\$ The only thing that might be of interest is that I'm running it through Emacs on a macbook pro. \$\endgroup\$ Commented Nov 28, 2013 at 17:34
  • \$\begingroup\$ Me too, so that can't be it. Can you post a self-contained test case? For example, you could write some code that generates 31 MiB of test data, and then we could check our timing against yours on the same data. \$\endgroup\$ Commented Nov 28, 2013 at 17:52
  • \$\begingroup\$ Hmmmm. I'd like to do that, but I wouldn't even know where to start with it. That's a little bit advanced for my level. I'd like to learn though if you wouldn't mine explaining how to go about doing that... \$\endgroup\$ Commented Nov 29, 2013 at 1:59
  • \$\begingroup\$ @GarethRees A second thing I just discovered...if I run the script in my terminal it finishes in about 2 seconds. It also completes within Eclipse Kepler within about 3 to 4 seconds. Hmmmm, Emacs? \$\endgroup\$ Commented Nov 29, 2013 at 2:18

2 Answers 2

4
\$\begingroup\$

You said in comments that you don't know how to create a self-contained test case. But that's really easy! All that's needed is a function like this:

def test_case(filename, n):
 """Write n lines of test data to filename."""
 with open(filename, 'w') as f:
 for i in range(n):
 f.write('Mon-{0:06d} {1}\n'.format(i + 1, ' 99.999' * 20))

You can use this to make a test case of about the right size:

>>> test_case('cr36275.data', 200000)
>>> import os
>>> os.stat('cr36275.data').st_size
34400000

That's about 34 MiB so close enough. Now we can see how fast your code really is, using the timeit module:

>>> from timeit import timeit
>>> timeit(lambda:pullIDs('cr36275.data'), number=1)
1.3354740142822266

Just over a second. There's nothing wrong with your code or the speed of Python.

So why does it take you many minutes? Well, you say that you're running it inside Emacs. That means that when you run

>>> pullIDs('cr36275.data')

Python prints out a list of 200,000 ids, and Emacs reads this line of output into the *Python* buffer and applies syntax highlighting rules to it as it goes. Emacs' syntax highlighting code is designed to work on lines of source code (at most a few hundred characters but mostly 80 characters or less), not on lines of output that are millions of characters long. This is what is taking all the time.

So don't do that. Read the list of ids into a variable and if you need to look at it, use slicing to look at bits of it:

>>> ids = pullIDs('cr36275.data')
>>> ids[:10]
[['Mon-000001'], ['Mon-000002'], ['Mon-000003'], ['Mon-000004'], ['Mon-000005'],
 ['Mon-000006'], ['Mon-000007'], ['Mon-000008'], ['Mon-000009'], ['Mon-000010']]
answered Nov 29, 2013 at 12:08
\$\endgroup\$
1
  • \$\begingroup\$ Wow. Thank you so much for this. It isn't crucial that I print these IDs to my screen but I wanted to see if my script was working correctly and this crazy slow down kind of confused me. But, you've cleared all of that up. Thanks for the test-case pointers and the insight into Emacs. \$\endgroup\$ Commented Nov 29, 2013 at 19:21
3
\$\begingroup\$

If you know that Mon-###### IDS will always be in the first part of each line there is no need to .findall(..), just extract the first 10 characters from the line (granted the Mon-###### is never more than 10 characters

answered Nov 28, 2013 at 8:02
\$\endgroup\$
4
  • \$\begingroup\$ If it isn't though would this be the only way? I only ask because I plan on using this on multiple files, some of which may not have the IDs in the first part of the line. Thanks. \$\endgroup\$ Commented Nov 28, 2013 at 8:10
  • \$\begingroup\$ Another thing could be to change .findall(..) to only finding the first instance (if there are some kind of .find() or .findfirst()). The difference would be that instead of traversing the entire line it only traverses it until it's found, a small improvement but I guess it adds up. \$\endgroup\$ Commented Nov 28, 2013 at 8:20
  • \$\begingroup\$ @Max: Did you try any of these suggestions? How much improvement to the runtime did they make? \$\endgroup\$ Commented Nov 28, 2013 at 14:26
  • \$\begingroup\$ No I didn't try it, and I don't expect it to be any major improvements, but when you're after performance boosts I guess it all adds up. \$\endgroup\$ Commented Nov 28, 2013 at 14:29

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.