Slow Python text-processing script

Question 1

This script of mine merges columns 1 and 2 from one input file and sees if these merged combinations exist in the other infile (and vice versa). I know I get stuck in appending. It did not get past that while running all night. The infiles are 13,000,000 lines long.

#!/usr/bin/env python
infile1=open('/path/to/tab/delimited/file1.txt', 'r')
comb1=[]
for Lina in infile1:
 if Lina.startswith('##'):
 continue
 Lina=Lina.strip('\n')
 tmp=Lina.split('\t')
 var=str(tmp[0])+str(tmp[1])
 if var not in comb1:
 comb1.append(var)
print "...finished first file..."
print combi1
infile2=open('/path/to/tab/delimited/file2.txt', 'r')
comb2=[]
for Lina in infile2:
 if Lina.startswith('##'):
 continue
 Lina=Lina.strip('\n')
 tmp=Lina.split('\t')
 var=str(tmp[0])+str(tmp[1])
 if var not in comb2:
 comb2.append(i)
print "...finished second infile..."
mismatch1=0
for i in comb2:
 if i not in comb1:
 mismatch1+=1
print "...finished first compairing..."
mismatch2=0
for i in comb1:
 if i not in comb2:
 mismatch2+=1
print "Length of popRaw is ", len(comb1)
print "Length of OC is ", len(comb2)
print mismatch1, " of OC are not in popRaw"
print mismatch2, " of popRaw are not in OC"

I think I have done similar things with an AWK one-liner and I think this is possible in one clause in python. It would be nice to learn new pythonic or other things but if there is an obvious error please tell me.

infiles looks like this after a large header of lines starting with ##:

N00001 555 . C T 30.33 . AC=1;AF=0.045;AN=22;BaseQRankSum=0.684;ClippingRankSum=-3.220e-01;DP=87;FS=0.000;GQ_MEAN=26.18;GQ_STDDEV
N00001 564 . C T 1724.25 . AC=10;AF=0.455;AN=22;BaseQRankSum=0.634;ClippingRankSum=-2.480e-01;DP=129;FS=0.000;GQ_MEAN=71.73;GQ_STDD
N00001 648 . A G 2097.88 . AC=12;AF=0.545;AN=22;BaseQRankSum=-8.180e-01;ClippingRankSum=0.00;DP=148;FS=0.000;GQ_MEAN=90.82;GQ_STDDE
N00001 836 . G A 3432.79 . AC=9;AF=0.409;AN=22;BaseQRankSum=2.96;ClippingRankSum=0.088;DP=288;FS=1.652;GQ_MEAN=147.64;GQ_STDDEV=83.
N00001 875 . C T 2461.02 . AC=6;AF=0.273;AN=22;BaseQRankSum=1.29;ClippingRankSum=0.167;DP=360;FS=1.608;GQ_MEAN=237.27;GQ_STDDEV=169
N00001 966 . A C 2328.53 . AC=5;AF=0.227;AN=22;BaseQRankSum=-1.572e+00;ClippingRankSum=-6.600e-01;DP=382;FS=2.294;GQ_MEAN=233.55;GQ
N00001 972 . G T 1502.09 . AC=4;AF=0.182;AN=22;BaseQRankSum=-2.550e-01;ClippingRankSum=-1.710e-01;DP=344;FS=0.576;GQ_MEAN=158.91;GQ
N00001 1014 . T C 375.27 . AC=1;AF=0.045;AN=22;BaseQRankSum=-2.917e+00;ClippingRankSum=0.517;DP=295;FS=0.000;GQ_MEAN=83.91;GQ_STDDE
N00001 1029 . A G 1825.66 . AC=10;AF=0.455;AN=22;BaseQRankSum=-3.418e+00;ClippingRankSum=0.149;DP=304;FS=3.241;GQ_MEAN=150.18;GQ_STD
N00001 1174 . C A 12316.07 . AC=20;AF=0.909;AN=22;BaseQRankSum=-6.910e-01;ClippingRankSum=0.829;DB;DP=452;FS=0.000;GQ_MEAN=17

And this

N00001 256 . G T 30.03 VQSRTrancheSNP90.00to99.00 AC=4;AF=0.667;AN=6;BaseQRankSum=0.736;DP=4;Dels=0.00;FS=0.000;HaplotypeScore=0.0
N00001 257 . G T 31.03 VQSRTrancheSNP90.00to99.00 AC=4;AF=0.667;AN=6;BaseQRankSum=-0.736;DP=4;Dels=0.00;FS=0.000;HaplotypeScore=0.
N00001 836 . G A 1265.57 VQSRTrancheSNP90.00to99.00 AC=6;AF=0.150;AN=40;BaseQRankSum=9.578;DP=385;Dels=0.00;FS=8.949;HaplotypeScore=
N00001 966 . A C 298.33 VQSRTrancheSNP90.00to99.00 AC=1;AF=0.025;AN=40;BaseQRankSum=-0.224;DP=388;Dels=0.00;FS=0.000;HaplotypeScore
N00001 1174 . C A 4566.80 PASS AC=23;AF=0.575;AN=40;BaseQRankSum=-5.661;DP=409;Dels=0.00;FS=3.770;HaplotypeScore=0.2233;InbreedingCoeff
N00001 1219 . T C 1748.80 VQSRTrancheSNP90.00to99.00 AC=7;AF=0.175;AN=40;BaseQRankSum=-1.403;DP=452;Dels=0.00;FS=2.151;HaplotypeScore
N00001 1234 . T C 1449.56 VQSRTrancheSNP90.00to99.00 AC=6;AF=0.150;AN=40;BaseQRankSum=-0.988;DP=446;Dels=0.00;FS=2.264;HaplotypeScore
N00001 1258 . G T 1450.74 VQSRTrancheSNP90.00to99.00 AC=7;AF=0.175;AN=40;BaseQRankSum=-3.583;DP=419;Dels=0.00;FS=1.054;HaplotypeScore
N00001 1260 . C T 3836 PASS AC=14;AF=0.350;AN=40;BaseQRankSum=10.337;DP=415;Dels=0.00;FS=2.211;HaplotypeScore=3.6986;InbreedingCoeff
N00001 1357 . G A 3660.33 VQSRTrancheSNP90.00to99.00 AC=14;AF=0.350;AN=40;BaseQRankSum=0.859;DP=365;Dels=0.00;FS=13.926;HaplotypeScor

Question 2

I tried printing out ´var´ in the first loop and it was all right (and fast to get there) but it was still in the next line when I break it with ctrl+C after many hours.

Question 3

There were originally two problems with this question. I've edited and reopened it. 1) Golfing is off-topic (see help center). 2) Asking for an AWK script while presenting a Python solution is not asking for a code review; it's asking for code to be written, which if off-topic. (The [awk] tag was inappropriate.)

Question 4

Great, thank you. I often rock the boat on S.E sites without intending to.

Question 5

There are a number of issues with this program. I'll start with some trivial ones first.

Hard-coding the input filenames makes it inflexible throwaway code. I suggest taking command-line parameters (sys.argv).
Variable names tmp and var are horribly uninformative. Also, Lina violates PEP 8 capitalization guidelines.
PEP 8 also calls for four spaces of indentation. Since whitespace is significant in Python, this is a strong convention.
Casting to str() is superfluous.
Every line in the program is repeated in some form. Cut-and-paste programming is poor practice. Use functions to avoid such duplication.
You opened two files without closing them, which is a file descriptor leak.
Consider using the csv module.

The big issue, though, is that the list is not an appropriate data structure. Searching for an entry in an array is O(n); doing so in a nested loop would be O(n²).

What you want is a set. Finding an entry in a dictionary or a set is O(1). As a bonus, operations such as set difference are already implemented for you.

import csv
import sys
def summarize(filename):
 with open(filename) as f:
 reader = csv.reader(f, delimiter="\t")
 return set(field[0] + field[1] for field in reader if not field[0].startswith('#'))
# Two filename parameters expected on the command line
file1, file2 = sys.argv[1:]
comb1 = summarize(file1)
print 'Finished reading %s' % (file1)
comb2 = summarize(file2)
print 'Finished reading %s' % (file2)
print '%d distinct entries in %s' % (len(comb1), file1)
print '%d distinct entries in %s' % (len(comb2), file2)
print '%d of %s are not in %s' % (len(comb2 - comb1), file2, file1)
print '%d of %s are not in %s' % (len(comb1 - comb2), file1, file2)

Caveat: If the same entry (first two columns) reappears within the same file, this script only treats them as one distinct entry.

Question 6

Yes it was sloppy but set saved the day, set intersection looks like a good tool

Question 7

Just curious... how long does it take to run now?

Question 8

Just over 4 min:

13945665 distinct entries in iii, 12723411 distinct entries in jjj, 4702437 of jjj are not in iii, 5924691 of iii are not in JJJ, 8020974 entries are in both files

Question 9

For Python 2 it's better to open csv files using rb.

Question 10

@DSM Please explain why?

score 6 · Accepted Answer · 2014-09-02 09:45:34Z

There are a number of issues with this program. I'll start with some trivial ones first.

Hard-coding the input filenames makes it inflexible throwaway code. I suggest taking command-line parameters (sys.argv).
Variable names tmp and var are horribly uninformative. Also, Lina violates PEP 8 capitalization guidelines.
PEP 8 also calls for four spaces of indentation. Since whitespace is significant in Python, this is a strong convention.
Casting to str() is superfluous.
Every line in the program is repeated in some form. Cut-and-paste programming is poor practice. Use functions to avoid such duplication.
You opened two files without closing them, which is a file descriptor leak.
Consider using the csv module.

The big issue, though, is that the list is not an appropriate data structure. Searching for an entry in an array is O(n); doing so in a nested loop would be O(n²).

What you want is a set. Finding an entry in a dictionary or a set is O(1). As a bonus, operations such as set difference are already implemented for you.

import csv
import sys
def summarize(filename):
 with open(filename) as f:
 reader = csv.reader(f, delimiter="\t")
 return set(field[0] + field[1] for field in reader if not field[0].startswith('#'))
# Two filename parameters expected on the command line
file1, file2 = sys.argv[1:]
comb1 = summarize(file1)
print 'Finished reading %s' % (file1)
comb2 = summarize(file2)
print 'Finished reading %s' % (file2)
print '%d distinct entries in %s' % (len(comb1), file1)
print '%d distinct entries in %s' % (len(comb2), file2)
print '%d of %s are not in %s' % (len(comb2 - comb1), file2, file1)
print '%d of %s are not in %s' % (len(comb1 - comb2), file1, file2)

Caveat: If the same entry (first two columns) reappears within the same file, this script only treats them as one distinct entry.

Yes it was sloppy but set saved the day, set intersection looks like a good tool
Just over 4 min: 13945665 distinct entries in iii, 12723411 distinct entries in jjj, 4702437 of jjj are not in iii, 5924691 of iii are not in JJJ, 8020974 entries are in both files

Stack Exchange Network

Slow Python text-processing script

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Slow Python text-processing script

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions