4
\$\begingroup\$

I am using Python 2.6.5 and am trying to find the fastest way to print out the contents of a .gz file. It's my understanding that prior to v2.5, zcat was much faster than gzip (see here)....I guess that has changed (at least according to a comment in that post)? I have unzipped a 2.4MB .gz file 3 ways and they all seem to take about 17 minutes. Is there a faster way?

This takes 17 minutes in Python:

d = zlib.decompressobj(16+zlib.MAX_WBITS)
f = open('/2.4MB.gz','rb')
buffer = f.read(1024)
while buffer:
 outstr = d.decompress(buffer)
 print(outstr)
 buffer = f.read(1024)
outstr = d.flush()
print(outstr)
f.close()

This also takes 17 minutes:

f = gzip.open('/2.4MB.gz', 'rb')
file_content = f.read()
print file_content 
f.close()

Again, 17 minutes:

def gziplines(fname):
 from subprocess import Popen, PIPE
 f = Popen(['zcat',fname],stdout = PIPE)
 for line in f.stdout:
 yield line
fname = '/2.4MB.gz'
for line in gziplines(fname):
 print line,

My eventual goal is to take the contents of the .gz file and dump them directly into a MySQL database without printing the lines. The unzipped file is 15.8MB.

When I gunzip the file and then use the CSV module to print out the contents to the screen, it takes 1 minute vs. the 17 minutes before. Printing in and of itself doesn't seem to be the problem.

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Feb 27, 2012 at 17:49
\$\endgroup\$
4
  • 6
    \$\begingroup\$ Are you profiling? I imagine that most of the 17 mins is spent printing to your screen. \$\endgroup\$ Commented Feb 27, 2012 at 17:54
  • 1
    \$\begingroup\$ how big is the uncompressed data, if it is huge then the print line is probably taking up 99% of your execution time. \$\endgroup\$ Commented Feb 27, 2012 at 17:55
  • \$\begingroup\$ Also, the speed issues in pre 2.5 python applied only to reading line by line which you aren't doing. \$\endgroup\$ Commented Feb 27, 2012 at 18:03
  • \$\begingroup\$ Ok, that's interesting. Can you share the file in question? \$\endgroup\$ Commented Feb 28, 2012 at 21:32

1 Answer 1

6
\$\begingroup\$

As it stands, you are measuring the speed of printing. Printing is one of the slowest things your program will ever have to do. Take the prints out and remeasure to find out the speed.

Your middle method will almost certainly be the fastest.

answered Feb 28, 2012 at 1:22
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.