I have created a python system that runs Linux core files through the crash debugger with some python extensions. This all works fine but one bit of is problematic.
These files are send to the system in gzip format and consist of a single huge data file. The compressed file can often be as big as 20G. The unzipping works fine but is very slow and often uses huge amounts of memory. As an example, last night this processed a 14G gzip file and it took 9.2 hours to uncompress it (60G uncompressed) and the memory utilisation hovered between 30G, peaking at 60G.
Starting to think perhaps my code is the cause.
def chk_gzip_file(FILE):
logger.info ("Will write uncompressed file to: "+ COREDIR)
if os.path.isdir(FILE) == False:
inF = gzip.open(FILE, 'rb')
s = inF.read()
inF.close()
gzip_fname = os.path.basename(FILE)
fname = gzip_fname[:-3]
uncompressed_path = os.path.join(COREDIR, fname)
open(uncompressed_path, 'w').write(s)
uncompressedfile=COREDIR+"/"+fname
return uncompressedfile
else:
logger.critical ("No gz file found : " + FILE)
sys.exit()
I am not a programmer so I imagine this is fairly poor code. Can this be improved for huge files? I know that speed will be an issue as gzip uncompress is single threaded.
1 Answer 1
inF = gzip.open(FILE, 'rb') s = inF.read() inF.close()
That reads the whole uncompressed data into memory. Of course it takes 60GB of memory.
Looking at the documentation for gzip
, it has this example of compressing a file:
import gzip
import shutil
with open('file.txt', 'rb') as f_in, gzip.open('file.txt.gz', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
If you switch that round to:
import gzip
import shutil
with open('file.txt', 'wb') as f_out, gzip.open('file.txt.gz', 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
then I think you'll find the memory usage is much lower.
Explore related questions
See similar questions with these tags.
if
should be indented that much, can you double check the code above is the exact same as in your IDE. \$\endgroup\$