I wrote this script for adding information to a compressed file and compressing the output:
#!/bin/env python3
import argparse
from Bio.bgzf import BgzfWriter
import gzip
import sys
def main():
args = parse_args()
# with BgzfWriter(args.out, 'wb') as fd_out:
with gzip.open(args.out, 'wt') as fd_out:
for i_vcf, vcf in enumerate(args.vcf):
with gzip.open(vcf, 'rt') as fd_vcf:
## loop metadata lines and header line
for line in fd_vcf:
## end of metadata lines
if line[0:2] != '##':
## print header line
print(line, end='', file=fd_out)
## break loop at the header line
break
if i_vcf == 0:
## print metadata line
print(line, end='', file=fd_out)
## loop records
for line in fd_vcf:
l = line.rstrip().split('\t')
s = '\t'.join(l[:9])+':GL'
i_PL = l[8].split(':').index('PL')
print(s, sep='\t', file=fd_out, end='\t')
for i_GT in range(9, len(l)):
s_PL = l[i_GT].split(':')[i_PL]
if s_PL == '.':
s_GL = '.'
else:
# l_probs = [
# pow(10, -int(log10likelihood)/10) for log10likelihood in s_PL.split(',')]
# sum_prob = sum(l_probs)
# s_GL = ','.join(map(str, (round(prob/sum_prob, 4) for prob in l_probs)))
s_GL = ','.join(str(-float(PL)/10) for PL in s_PL.split(','))
print(':'.join((l[i_GT], s_GL)), end='\t', file=fd_out)
print(file=fd_out)
return
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('--vcf', nargs = '+')
parser.add_argument('--out', required=True)
args = parser.parse_args()
return args
if __name__ == '__main__':
main()
Here is an example of an input file:
fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=VQSRTrancheINDEL99.00to99.50,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -1.0597 <= x < 0.1687">
##FILTER=<ID=VQSRTrancheINDEL99.50to99.90,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -5.8589 <= x < -1.0597">
##FILTER=<ID=VQSRTrancheINDEL99.90to99.95,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -6.5565 <= x < -5.8589">
##FILTER=<ID=VQSRTrancheINDEL99.95to100.00+,Description="Truth sensitivity tranche level for INDEL model at VQS Lod < -70.77">
##FILTER=<ID=VQSRTrancheINDEL99.95to100.00,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -70.77 <= x < -6.5565">
##FILTER=<ID=VQSRTrancheSNP99.90to99.95,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -96.2455 <= x < -14.2763">
##FILTER=<ID=VQSRTrancheSNP99.95to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -437.9355">
##FILTER=<ID=VQSRTrancheSNP99.95to100.00,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -437.9355 <= x < -96.2455">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
...
The gzip
module seems somewhat slow in 'rt' mode. Is there any way I can make this run faster? I've run cProfile on it, and after letting it run for ~35 seconds, I got this output:
ncalls tottime percall cumtime percall filename:lineno(function)
967857 4.333 0.000 15.627 0.000 {built-in method builtins.print}
1936445 4.534 0.000 11.917 0.000 {method 'join' of 'str' objects}
4528 0.049 0.000 9.982 0.002 gzip.py:328(write)
4527 9.853 0.002 9.853 0.002 {method 'compress' of 'zlib.Compress' objects}
3950458 7.382 0.000 7.382 0.000 PL2GL.py:42(<genexpr>)
1986270 2.012 0.000 2.012 0.000 {method 'split' of 'str' objects}
Here is a small input file on which it will work:
##foo
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT x1 x2 x3
18 10310 . A C,G 9.22 . bar GT:AD:DP:GQ:PL 0/0:2,1,0:3:6:0,6,55,6,55,55 0/0:11,0,0:12:27:0,27,249,27,249,249 0/0:6,0,0:6:15:0,15,140,15,140,140
1 Answer 1
UX
When I run the code with the -h
option, I see there are 2 options
to run the code (--vcf
and --out
). But, I don't really know
what they mean. You should use the help
parameter for the add_argument
function to provide more information to the user. If vcf
is an acronym,
spell it out.
It also seems like the --vcf
option is required.
Comments
Remove all commented-out code to reduce clutter:
# with BgzfWriter(args.out, 'wb') as fd_out:
This line of code is also not needed and can be removed:
from Bio.bgzf import BgzfWriter
Remove these lines as well:
# l_probs = [
# pow(10, -int(log10likelihood)/10) for log10likelihood in s_PL.split(',')]
# sum_prob = sum(l_probs)
# s_GL = ','.join(map(str, (round(prob/sum_prob, 4) for prob in l_probs)))
Naming
Lower-case l
is a bad name for a variable because it is easily confused
with the number 1
, not to mention that it conveys little meaning:
l = line.rstrip().split('\t')
Perhaps something more meaningful such as:
columns = line.rstrip().split('\t')
These variables could also have better names: s
, s_PL
, s_GL
, etc.
The function name main
is too generic. Perhaps this is more meaningful:
def compress_files():
Documentation
The PEP 8 style guide recommends adding docstrings for functions.
Tools
You could run code development tools to automatically find some style issues with your code.
ruff
finds this:
= help: Remove unused import: `sys`
Explore related questions
See similar questions with these tags.
rt
for text instead ofrb
/r
for binary does a text conversion of the bytes which makes sense for header lines. But you can also convert the bytes yourself, just on the header lines. Test all with special Unicode symbols. And yes the conversion takes time especially for UTF-8. \$\endgroup\$