Any time that a row ID (oddly placed in column 8, i.e. row[7]) is repeated after the first instance, I want to write those rows into a second file. The code I'm using is extremely slow -- it's a 40-column CSV with about a million rows. This is what I have:
def in_out_gorbsplit(inf, outf1, outf2):
outf1 = csv.writer(open(outf1, 'wb'), delimiter=',', lineterminator='\n')
outf2 = csv.writer(open(outf2, 'wb'), delimiter=',', lineterminator='\n')
inf1 = csv.reader(open(inf, 'rbU'), delimiter=',')
inf1.next()
checklist = []
for row in inf1:
id_num = str(row[7])
if id_num not in checklist:
outf1.writerow(row)
checklist.append(id_num)
else:
outf2.writerow(row)
1 Answer 1
Since checklist
is a list, a "not in" operation has to iterate over all elements to give the correct answer. In other words, it has a complexity of \$O(n)\$. Use a set()
instead, to lower the complexity of the operation to \$O(1)\,ドル making it much faster.
Also don't forget to close open file handles.
-
1\$\begingroup\$ This made the slight difference between unoptimized code taking 45 minutes and optimized code taking... 5.5 seconds. I knew something was off! \$\endgroup\$Xodarap777– Xodarap7772014年11月30日 10:13:35 +00:00Commented Nov 30, 2014 at 10:13