This script of mine merges columns 1 and 2 from one input file and sees if these merged combinations exist in the other infile (and vice versa). I know I get stuck in appending. It did not get past that while running all night. The infiles are 13,000,000 lines long.
#!/usr/bin/env python
infile1=open('/path/to/tab/delimited/file1.txt', 'r')
comb1=[]
for Lina in infile1:
if Lina.startswith('##'):
continue
Lina=Lina.strip('\n')
tmp=Lina.split('\t')
var=str(tmp[0])+str(tmp[1])
if var not in comb1:
comb1.append(var)
print "...finished first file..."
print combi1
infile2=open('/path/to/tab/delimited/file2.txt', 'r')
comb2=[]
for Lina in infile2:
if Lina.startswith('##'):
continue
Lina=Lina.strip('\n')
tmp=Lina.split('\t')
var=str(tmp[0])+str(tmp[1])
if var not in comb2:
comb2.append(i)
print "...finished second infile..."
mismatch1=0
for i in comb2:
if i not in comb1:
mismatch1+=1
print "...finished first compairing..."
mismatch2=0
for i in comb1:
if i not in comb2:
mismatch2+=1
print "Length of popRaw is ", len(comb1)
print "Length of OC is ", len(comb2)
print mismatch1, " of OC are not in popRaw"
print mismatch2, " of popRaw are not in OC"
I think I have done similar things with an AWK one-liner and I think this is possible in one clause in python. It would be nice to learn new pythonic or other things but if there is an obvious error please tell me.
infiles looks like this after a large header of lines starting with ##:
N00001 555 . C T 30.33 . AC=1;AF=0.045;AN=22;BaseQRankSum=0.684;ClippingRankSum=-3.220e-01;DP=87;FS=0.000;GQ_MEAN=26.18;GQ_STDDEV
N00001 564 . C T 1724.25 . AC=10;AF=0.455;AN=22;BaseQRankSum=0.634;ClippingRankSum=-2.480e-01;DP=129;FS=0.000;GQ_MEAN=71.73;GQ_STDD
N00001 648 . A G 2097.88 . AC=12;AF=0.545;AN=22;BaseQRankSum=-8.180e-01;ClippingRankSum=0.00;DP=148;FS=0.000;GQ_MEAN=90.82;GQ_STDDE
N00001 836 . G A 3432.79 . AC=9;AF=0.409;AN=22;BaseQRankSum=2.96;ClippingRankSum=0.088;DP=288;FS=1.652;GQ_MEAN=147.64;GQ_STDDEV=83.
N00001 875 . C T 2461.02 . AC=6;AF=0.273;AN=22;BaseQRankSum=1.29;ClippingRankSum=0.167;DP=360;FS=1.608;GQ_MEAN=237.27;GQ_STDDEV=169
N00001 966 . A C 2328.53 . AC=5;AF=0.227;AN=22;BaseQRankSum=-1.572e+00;ClippingRankSum=-6.600e-01;DP=382;FS=2.294;GQ_MEAN=233.55;GQ
N00001 972 . G T 1502.09 . AC=4;AF=0.182;AN=22;BaseQRankSum=-2.550e-01;ClippingRankSum=-1.710e-01;DP=344;FS=0.576;GQ_MEAN=158.91;GQ
N00001 1014 . T C 375.27 . AC=1;AF=0.045;AN=22;BaseQRankSum=-2.917e+00;ClippingRankSum=0.517;DP=295;FS=0.000;GQ_MEAN=83.91;GQ_STDDE
N00001 1029 . A G 1825.66 . AC=10;AF=0.455;AN=22;BaseQRankSum=-3.418e+00;ClippingRankSum=0.149;DP=304;FS=3.241;GQ_MEAN=150.18;GQ_STD
N00001 1174 . C A 12316.07 . AC=20;AF=0.909;AN=22;BaseQRankSum=-6.910e-01;ClippingRankSum=0.829;DB;DP=452;FS=0.000;GQ_MEAN=17
And this
N00001 256 . G T 30.03 VQSRTrancheSNP90.00to99.00 AC=4;AF=0.667;AN=6;BaseQRankSum=0.736;DP=4;Dels=0.00;FS=0.000;HaplotypeScore=0.0
N00001 257 . G T 31.03 VQSRTrancheSNP90.00to99.00 AC=4;AF=0.667;AN=6;BaseQRankSum=-0.736;DP=4;Dels=0.00;FS=0.000;HaplotypeScore=0.
N00001 836 . G A 1265.57 VQSRTrancheSNP90.00to99.00 AC=6;AF=0.150;AN=40;BaseQRankSum=9.578;DP=385;Dels=0.00;FS=8.949;HaplotypeScore=
N00001 966 . A C 298.33 VQSRTrancheSNP90.00to99.00 AC=1;AF=0.025;AN=40;BaseQRankSum=-0.224;DP=388;Dels=0.00;FS=0.000;HaplotypeScore
N00001 1174 . C A 4566.80 PASS AC=23;AF=0.575;AN=40;BaseQRankSum=-5.661;DP=409;Dels=0.00;FS=3.770;HaplotypeScore=0.2233;InbreedingCoeff
N00001 1219 . T C 1748.80 VQSRTrancheSNP90.00to99.00 AC=7;AF=0.175;AN=40;BaseQRankSum=-1.403;DP=452;Dels=0.00;FS=2.151;HaplotypeScore
N00001 1234 . T C 1449.56 VQSRTrancheSNP90.00to99.00 AC=6;AF=0.150;AN=40;BaseQRankSum=-0.988;DP=446;Dels=0.00;FS=2.264;HaplotypeScore
N00001 1258 . G T 1450.74 VQSRTrancheSNP90.00to99.00 AC=7;AF=0.175;AN=40;BaseQRankSum=-3.583;DP=419;Dels=0.00;FS=1.054;HaplotypeScore
N00001 1260 . C T 3836 PASS AC=14;AF=0.350;AN=40;BaseQRankSum=10.337;DP=415;Dels=0.00;FS=2.211;HaplotypeScore=3.6986;InbreedingCoeff
N00001 1357 . G A 3660.33 VQSRTrancheSNP90.00to99.00 AC=14;AF=0.350;AN=40;BaseQRankSum=0.859;DP=365;Dels=0.00;FS=13.926;HaplotypeScor
-
\$\begingroup\$ I tried printing out ´var´ in the first loop and it was all right (and fast to get there) but it was still in the next line when I break it with ctrl+C after many hours. \$\endgroup\$AWE– AWE2014年09月02日 08:10:52 +00:00Commented Sep 2, 2014 at 8:10
-
\$\begingroup\$ There were originally two problems with this question. I've edited and reopened it. 1) Golfing is off-topic (see help center). 2) Asking for an AWK script while presenting a Python solution is not asking for a code review; it's asking for code to be written, which if off-topic. (The [awk] tag was inappropriate.) \$\endgroup\$200_success– 200_success2014年09月02日 08:45:54 +00:00Commented Sep 2, 2014 at 8:45
-
\$\begingroup\$ Great, thank you. I often rock the boat on S.E sites without intending to. \$\endgroup\$AWE– AWE2014年09月02日 09:00:53 +00:00Commented Sep 2, 2014 at 9:00
1 Answer 1
There are a number of issues with this program. I'll start with some trivial ones first.
- Hard-coding the input filenames makes it inflexible throwaway code. I suggest taking command-line parameters (
sys.argv
). - Variable names
tmp
andvar
are horribly uninformative. Also,Lina
violates PEP 8 capitalization guidelines. - PEP 8 also calls for four spaces of indentation. Since whitespace is significant in Python, this is a strong convention.
- Casting to
str()
is superfluous. - Every line in the program is repeated in some form. Cut-and-paste programming is poor practice. Use functions to avoid such duplication.
- You opened two files without closing them, which is a file descriptor leak.
- Consider using the
csv
module.
The big issue, though, is that the list is not an appropriate data structure. Searching for an entry in an array is O(n); doing so in a nested loop would be O(n2).
What you want is a set. Finding an entry in a dictionary or a set is O(1). As a bonus, operations such as set difference are already implemented for you.
import csv
import sys
def summarize(filename):
with open(filename) as f:
reader = csv.reader(f, delimiter="\t")
return set(field[0] + field[1] for field in reader if not field[0].startswith('#'))
# Two filename parameters expected on the command line
file1, file2 = sys.argv[1:]
comb1 = summarize(file1)
print 'Finished reading %s' % (file1)
comb2 = summarize(file2)
print 'Finished reading %s' % (file2)
print '%d distinct entries in %s' % (len(comb1), file1)
print '%d distinct entries in %s' % (len(comb2), file2)
print '%d of %s are not in %s' % (len(comb2 - comb1), file2, file1)
print '%d of %s are not in %s' % (len(comb1 - comb2), file1, file2)
Caveat: If the same entry (first two columns) reappears within the same file, this script only treats them as one distinct entry.
-
\$\begingroup\$ Yes it was sloppy but
set
saved the day,set intersection
looks like a good tool \$\endgroup\$AWE– AWE2014年09月02日 11:51:48 +00:00Commented Sep 2, 2014 at 11:51 -
\$\begingroup\$ Just curious... how long does it take to run now? \$\endgroup\$200_success– 200_success2014年09月02日 11:56:46 +00:00Commented Sep 2, 2014 at 11:56
-
\$\begingroup\$ Just over 4 min:
13945665 distinct entries in iii, 12723411 distinct entries in jjj, 4702437 of jjj are not in iii, 5924691 of iii are not in JJJ, 8020974 entries are in both files
\$\endgroup\$AWE– AWE2014年09月02日 12:17:19 +00:00Commented Sep 2, 2014 at 12:17 -
\$\begingroup\$ For Python 2 it's better to open csv files using
rb
. \$\endgroup\$DSM– DSM2014年09月03日 04:32:10 +00:00Commented Sep 3, 2014 at 4:32 -
\$\begingroup\$ @DSM Please explain why? \$\endgroup\$200_success– 200_success2014年09月03日 04:35:40 +00:00Commented Sep 3, 2014 at 4:35
Explore related questions
See similar questions with these tags.