Improving gzip function for huge files

Question 1

I have created a python system that runs Linux core files through the crash debugger with some python extensions. This all works fine but one bit of is problematic.

These files are send to the system in gzip format and consist of a single huge data file. The compressed file can often be as big as 20G. The unzipping works fine but is very slow and often uses huge amounts of memory. As an example, last night this processed a 14G gzip file and it took 9.2 hours to uncompress it (60G uncompressed) and the memory utilisation hovered between 30G, peaking at 60G.

Starting to think perhaps my code is the cause.

def chk_gzip_file(FILE):
 logger.info ("Will write uncompressed file to: "+ COREDIR) 
 if os.path.isdir(FILE) == False:
 inF = gzip.open(FILE, 'rb')
 s = inF.read()
 inF.close()
 gzip_fname = os.path.basename(FILE)
 fname = gzip_fname[:-3]
 uncompressed_path = os.path.join(COREDIR, fname)
 open(uncompressed_path, 'w').write(s)
 uncompressedfile=COREDIR+"/"+fname
 return uncompressedfile
 else: 
 logger.critical ("No gz file found : " + FILE) 
 sys.exit()

I am not a programmer so I imagine this is fairly poor code. Can this be improved for huge files? I know that speed will be an issue as gzip uncompress is single threaded.

Question 2

I don't think the if should be indented that much, can you double check the code above is the exact same as in your IDE.

Question 3

@PeterTaylor You shouldn't edit code in questions, Should you edit someone else's code in a question?

Question 4

@Peilonrayz, read the answer meta.codereview.stackexchange.com/a/1816/1402 in the thread you reference. OP has clearly stated that the indentation broke when posting, and doesn't have the ability to edit to correct it himself.

Question 5

 inF = gzip.open(FILE, 'rb')
 s = inF.read()
 inF.close()

That reads the whole uncompressed data into memory. Of course it takes 60GB of memory.

Looking at the documentation for gzip, it has this example of compressing a file:

import gzip
import shutil
with open('file.txt', 'rb') as f_in, gzip.open('file.txt.gz', 'wb') as f_out:
 shutil.copyfileobj(f_in, f_out)

If you switch that round to:

import gzip
import shutil
with open('file.txt', 'wb') as f_out, gzip.open('file.txt.gz', 'rb') as f_in:
 shutil.copyfileobj(f_in, f_out)

then I think you'll find the memory usage is much lower.

Peter Taylor Peter Taylor 24.5k1 gold badge49 silver badges94 bronze badges · Answer 1 · 2017-02-22 10:29:00Z

 inF = gzip.open(FILE, 'rb')
 s = inF.read()
 inF.close()

That reads the whole uncompressed data into memory. Of course it takes 60GB of memory.

Looking at the documentation for gzip, it has this example of compressing a file:

import gzip
import shutil
with open('file.txt', 'rb') as f_in, gzip.open('file.txt.gz', 'wb') as f_out:
 shutil.copyfileobj(f_in, f_out)

If you switch that round to:

import gzip
import shutil
with open('file.txt', 'wb') as f_out, gzip.open('file.txt.gz', 'rb') as f_in:
 shutil.copyfileobj(f_in, f_out)

then I think you'll find the memory usage is much lower.

Stack Exchange Network

Improving gzip function for huge files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Improving gzip function for huge files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions