3
\$\begingroup\$

Any time that a row ID (oddly placed in column 8, i.e. row[7]) is repeated after the first instance, I want to write those rows into a second file. The code I'm using is extremely slow -- it's a 40-column CSV with about a million rows. This is what I have:

def in_out_gorbsplit(inf, outf1, outf2):
 outf1 = csv.writer(open(outf1, 'wb'), delimiter=',', lineterminator='\n')
 outf2 = csv.writer(open(outf2, 'wb'), delimiter=',', lineterminator='\n')
 inf1 = csv.reader(open(inf, 'rbU'), delimiter=',')
 inf1.next()
 checklist = []
 for row in inf1:
 id_num = str(row[7])
 if id_num not in checklist:
 outf1.writerow(row)
 checklist.append(id_num)
 else:
 outf2.writerow(row)
asked Nov 30, 2014 at 4:49
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

Since checklist is a list, a "not in" operation has to iterate over all elements to give the correct answer. In other words, it has a complexity of \$O(n)\$. Use a set() instead, to lower the complexity of the operation to \$O(1)\,ドル making it much faster.

Also don't forget to close open file handles.

answered Nov 30, 2014 at 8:44
\$\endgroup\$
1
  • 1
    \$\begingroup\$ This made the slight difference between unoptimized code taking 45 minutes and optimized code taking... 5.5 seconds. I knew something was off! \$\endgroup\$ Commented Nov 30, 2014 at 10:13

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.