4
\$\begingroup\$

I have created a python system that runs Linux core files through the crash debugger with some python extensions. This all works fine but one bit of is problematic.

These files are send to the system in gzip format and consist of a single huge data file. The compressed file can often be as big as 20G. The unzipping works fine but is very slow and often uses huge amounts of memory. As an example, last night this processed a 14G gzip file and it took 9.2 hours to uncompress it (60G uncompressed) and the memory utilisation hovered between 30G, peaking at 60G.

Starting to think perhaps my code is the cause.

def chk_gzip_file(FILE):
 logger.info ("Will write uncompressed file to: "+ COREDIR) 
 if os.path.isdir(FILE) == False:
 inF = gzip.open(FILE, 'rb')
 s = inF.read()
 inF.close()
 gzip_fname = os.path.basename(FILE)
 fname = gzip_fname[:-3]
 uncompressed_path = os.path.join(COREDIR, fname)
 open(uncompressed_path, 'w').write(s)
 uncompressedfile=COREDIR+"/"+fname
 return uncompressedfile
 else: 
 logger.critical ("No gz file found : " + FILE) 
 sys.exit()

I am not a programmer so I imagine this is fairly poor code. Can this be improved for huge files? I know that speed will be an issue as gzip uncompress is single threaded.

200_success
146k22 gold badges190 silver badges479 bronze badges
asked Feb 22, 2017 at 9:57
\$\endgroup\$
3
  • \$\begingroup\$ I don't think the if should be indented that much, can you double check the code above is the exact same as in your IDE. \$\endgroup\$ Commented Feb 22, 2017 at 10:17
  • \$\begingroup\$ @PeterTaylor You shouldn't edit code in questions, Should you edit someone else's code in a question? \$\endgroup\$ Commented Feb 22, 2017 at 10:34
  • \$\begingroup\$ @Peilonrayz, read the answer meta.codereview.stackexchange.com/a/1816/1402 in the thread you reference. OP has clearly stated that the indentation broke when posting, and doesn't have the ability to edit to correct it himself. \$\endgroup\$ Commented Feb 22, 2017 at 10:36

1 Answer 1

9
\$\begingroup\$
 inF = gzip.open(FILE, 'rb')
 s = inF.read()
 inF.close()

That reads the whole uncompressed data into memory. Of course it takes 60GB of memory.

Looking at the documentation for gzip, it has this example of compressing a file:

import gzip
import shutil
with open('file.txt', 'rb') as f_in, gzip.open('file.txt.gz', 'wb') as f_out:
 shutil.copyfileobj(f_in, f_out)

If you switch that round to:

import gzip
import shutil
with open('file.txt', 'wb') as f_out, gzip.open('file.txt.gz', 'rb') as f_in:
 shutil.copyfileobj(f_in, f_out)

then I think you'll find the memory usage is much lower.

answered Feb 22, 2017 at 10:29
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.