Adding information to a compressed file and compressing the output

Question 1

I wrote this script for adding information to a compressed file and compressing the output:

#!/bin/env python3
import argparse
from Bio.bgzf import BgzfWriter
import gzip
import sys
def main():
 args = parse_args()
# with BgzfWriter(args.out, 'wb') as fd_out:
 with gzip.open(args.out, 'wt') as fd_out:
 for i_vcf, vcf in enumerate(args.vcf):
 with gzip.open(vcf, 'rt') as fd_vcf:
 ## loop metadata lines and header line
 for line in fd_vcf:
 ## end of metadata lines
 if line[0:2] != '##':
 ## print header line
 print(line, end='', file=fd_out)
 ## break loop at the header line
 break
 if i_vcf == 0:
 ## print metadata line
 print(line, end='', file=fd_out)
 ## loop records
 for line in fd_vcf:
 l = line.rstrip().split('\t')
 s = '\t'.join(l[:9])+':GL'
 i_PL = l[8].split(':').index('PL')
 print(s, sep='\t', file=fd_out, end='\t')
 for i_GT in range(9, len(l)):
 s_PL = l[i_GT].split(':')[i_PL]
 if s_PL == '.':
 s_GL = '.'
 else:
# l_probs = [
# pow(10, -int(log10likelihood)/10) for log10likelihood in s_PL.split(',')]
# sum_prob = sum(l_probs)
# s_GL = ','.join(map(str, (round(prob/sum_prob, 4) for prob in l_probs)))
 s_GL = ','.join(str(-float(PL)/10) for PL in s_PL.split(','))
 print(':'.join((l[i_GT], s_GL)), end='\t', file=fd_out)
 print(file=fd_out)
 return
def parse_args():
 parser = argparse.ArgumentParser()
 parser.add_argument('--vcf', nargs = '+')
 parser.add_argument('--out', required=True)
 args = parser.parse_args()
 return args
if __name__ == '__main__':
 main()

Here is an example of an input file:

 fileformat=VCFv4.1
 ##FILTER=<ID=LowQual,Description="Low quality">
 ##FILTER=<ID=VQSRTrancheINDEL99.00to99.50,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -1.0597 <= x < 0.1687">
 ##FILTER=<ID=VQSRTrancheINDEL99.50to99.90,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -5.8589 <= x < -1.0597">
 ##FILTER=<ID=VQSRTrancheINDEL99.90to99.95,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -6.5565 <= x < -5.8589">
 ##FILTER=<ID=VQSRTrancheINDEL99.95to100.00+,Description="Truth sensitivity tranche level for INDEL model at VQS Lod < -70.77">
 ##FILTER=<ID=VQSRTrancheINDEL99.95to100.00,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -70.77 <= x < -6.5565">
 ##FILTER=<ID=VQSRTrancheSNP99.90to99.95,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -96.2455 <= x < -14.2763">
 ##FILTER=<ID=VQSRTrancheSNP99.95to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -437.9355">
 ##FILTER=<ID=VQSRTrancheSNP99.95to100.00,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -437.9355 <= x < -96.2455">
 ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
 ...

The gzip module seems somewhat slow in 'rt' mode. Is there any way I can make this run faster? I've run cProfile on it, and after letting it run for ~35 seconds, I got this output:

 ncalls tottime percall cumtime percall filename:lineno(function)
 967857 4.333 0.000 15.627 0.000 {built-in method builtins.print}
 1936445 4.534 0.000 11.917 0.000 {method 'join' of 'str' objects}
 4528 0.049 0.000 9.982 0.002 gzip.py:328(write)
 4527 9.853 0.002 9.853 0.002 {method 'compress' of 'zlib.Compress' objects}
 3950458 7.382 0.000 7.382 0.000 PL2GL.py:42(<genexpr>)
 1986270 2.012 0.000 2.012 0.000 {method 'split' of 'str' objects}

Here is a small input file on which it will work:

 ##foo
 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT x1 x2 x3
 18 10310 . A C,G 9.22 . bar GT:AD:DP:GQ:PL 0/0:2,1,0:3:6:0,6,55,6,55,55 0/0:11,0,0:12:27:0,27,249,27,249,249 0/0:6,0,0:6:15:0,15,140,15,140,140

Question 2

File mode rt for text instead of rb / r for binary does a text conversion of the bytes which makes sense for header lines. But you can also convert the bytes yourself, just on the header lines. Test all with special Unicode symbols. And yes the conversion takes time especially for UTF-8.

Question 3

UX

When I run the code with the -h option, I see there are 2 options to run the code (--vcf and --out). But, I don't really know what they mean. You should use the help parameter for the add_argument function to provide more information to the user. If vcf is an acronym, spell it out.

It also seems like the --vcf option is required.

Comments

Remove all commented-out code to reduce clutter:

# with BgzfWriter(args.out, 'wb') as fd_out:

This line of code is also not needed and can be removed:

from Bio.bgzf import BgzfWriter

Remove these lines as well:

# l_probs = [
# pow(10, -int(log10likelihood)/10) for log10likelihood in s_PL.split(',')]
# sum_prob = sum(l_probs)
# s_GL = ','.join(map(str, (round(prob/sum_prob, 4) for prob in l_probs)))

Naming

Lower-case l is a bad name for a variable because it is easily confused with the number 1, not to mention that it conveys little meaning:

l = line.rstrip().split('\t')

Perhaps something more meaningful such as:

columns = line.rstrip().split('\t')

These variables could also have better names: s, s_PL, s_GL, etc.

The function name main is too generic. Perhaps this is more meaningful:

def compress_files():

Documentation

The PEP 8 style guide recommends adding docstrings for functions.

Tools

You could run code development tools to automatically find some style issues with your code.

ruff finds this:

= help: Remove unused import: `sys`

toolic toolic 14.5k5 gold badges29 silver badges203 bronze badges · Answer 1 · 2025-03-17 12:10:47Z

UX

When I run the code with the -h option, I see there are 2 options to run the code (--vcf and --out). But, I don't really know what they mean. You should use the help parameter for the add_argument function to provide more information to the user. If vcf is an acronym, spell it out.

It also seems like the --vcf option is required.

Comments

Remove all commented-out code to reduce clutter:

# with BgzfWriter(args.out, 'wb') as fd_out:

This line of code is also not needed and can be removed:

from Bio.bgzf import BgzfWriter

Remove these lines as well:

# l_probs = [
# pow(10, -int(log10likelihood)/10) for log10likelihood in s_PL.split(',')]
# sum_prob = sum(l_probs)
# s_GL = ','.join(map(str, (round(prob/sum_prob, 4) for prob in l_probs)))

Naming

Lower-case l is a bad name for a variable because it is easily confused with the number 1, not to mention that it conveys little meaning:

l = line.rstrip().split('\t')

Perhaps something more meaningful such as:

columns = line.rstrip().split('\t')

These variables could also have better names: s, s_PL, s_GL, etc.

The function name main is too generic. Perhaps this is more meaningful:

def compress_files():

Documentation

The PEP 8 style guide recommends adding docstrings for functions.

Tools

You could run code development tools to automatically find some style issues with your code.

ruff finds this:

= help: Remove unused import: `sys`

Stack Exchange Network

Adding information to a compressed file and compressing the output

1 Answer 1

UX

Comments

Naming

Documentation

Tools

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Adding information to a compressed file and compressing the output

1 Answer 1

UX

Comments

Naming

Documentation

Tools

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions