Asked 10 years, 10 months ago

Viewed 72 times

\$\begingroup\$

Any time that a row ID (oddly placed in column 8, i.e. row[7]) is repeated after the first instance, I want to write those rows into a second file. The code I'm using is extremely slow -- it's a 40-column CSV with about a million rows. This is what I have:

def in_out_gorbsplit(inf, outf1, outf2):
 outf1 = csv.writer(open(outf1, 'wb'), delimiter=',', lineterminator='\n')
 outf2 = csv.writer(open(outf2, 'wb'), delimiter=',', lineterminator='\n')
 inf1 = csv.reader(open(inf, 'rbU'), delimiter=',')
 inf1.next()
 checklist = []
 for row in inf1:
 id_num = str(row[7])
 if id_num not in checklist:
 outf1.writerow(row)
 checklist.append(id_num)
 else:
 outf2.writerow(row)

edited Nov 30, 2014 at 4:58

Xodarap777Xodarap777

asked Nov 30, 2014 at 4:49

Xodarap777's user avatar

Xodarap777 Xodarap777

3911 gold badge3 silver badges9 bronze badges

\$\endgroup\$

Add a comment |

1 Answer 1

Sorted by: Reset to default

\$\begingroup\$

Since checklist is a list, a "not in" operation has to iterate over all elements to give the correct answer. In other words, it has a complexity of \$O(n)\$. Use a set() instead, to lower the complexity of the operation to \$O(1)\,ドル making it much faster.

Also don't forget to close open file handles.

edited Nov 30, 2014 at 10:47

answered Nov 30, 2014 at 8:44

janos's user avatar

janos janos

113k15 gold badges154 silver badges396 bronze badges

\$\endgroup\$

1

\$\begingroup\$ This made the slight difference between unoptimized code taking 45 minutes and optimized code taking... 5.5 seconds. I knew something was off! \$\endgroup\$

Xodarap777
– Xodarap777

2014年11月30日 10:13:35 +00:00
Commented Nov 30, 2014 at 10:13

Add a comment |

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

Stack Exchange Network

Split CSV by Repeated cells python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Split CSV by Repeated cells python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions