Reading, writing and filtering a CSV file

Question 1

I have a CSV with 5+ million rows and I want to filter it:

with open("test1.csv") as csv_file:
 writer =csv.writer(outputfile, delimiter=',')
 for row in csv.reader(csv_file, delimiter=','):
 row = [int(x) for x in row]
 for i in range (1,11):
 if add_i (row) == True:
 break
 elif row not in (row_dup):
 row_dup.append(row)
 writer.writerow(row)

I have many functions, the simplest one being add_i which you can see in the snippet above.

if row[0:3] == range(row[0], row[0] +3 * i, i):
 return True

row_dup = [] is an empty array I use to store duplicates so that they don't get written to the file.

I want to improve the execution time of my script, but it takes hours. According to the calculation of a friend, it takes 530 hours or so. I'm not sure about that but regardless, it is still taking a long time.

How I tried to improve the speed of the script:

Using PyPy but I didn't notice a significant improvement.
Removing row = [int(x) for x in row] but then I'd have to write if [int(x) for x in row[0:4]] == range(int(row[0]), int(row[0]) + 4 * i, i): it's annoying and there's still a for loop.
Removing for i in range (1,11) but then I would have to write the indexes myself.
Removing for i in range (1,11) and row = [int(x) for x in row] and doing it like so
```
if (int(row[0]) + 1 == int(row[1])) and int(row[1]) + 1 == int(row[2])
and int(row[2]) + 1 == int(row[3]) and int(row[3]) + 1 == int(row[4]):
 do something
```
This is a desperate plan and the most annoying. I get rid of two for loops and do everything by hand, increasing the index value from 1 to 11 by copy pasting.
Considered using Hadoop but the VPS I have are so much less powerful than my PC, low end box.
Considered using SQLite instead of CSV but not convinced it would make things much faster.
Considered writing it in C++, and in fact the first version was in C++ but moved to Python since I didn't notice a big difference and because it's easier to plot in Python.

And by the way, I'm on Manjaro 64 bit, filesystem Ext4, Python 2.7, running in virtualenv. I'm only using a CSV module. I have an HP Pavilion G6 laptop, if you want to see the specs of it.

Note that the code is taking that much time, even though I'm only testing with add_i so I didn't even bother testing the complicated function. Who knows how much time it would take then.

Question 2

If you have a CSV file with five million or more rows, it may be time to upgrade to a real database.

Question 3

@EthanBierlein would it improve the execution time? the reason I don't use a database is because I'm the only one using the data and I keep a backup of the csv and it's only on my PC, no websites

Question 4

@Lynob Yes. A database using a query language like SQL would be much faster, simply because it's highly optimized. You can always just keep your data on your localhost as well.

Question 5

@Lynob I don't see much else that can really be improved in this little snippet.

Question 6

@Quill True but I don't think it could be done in less than 3 days or so

Question 7

This is desperate plan and the most annoying, I get rid of 2 for loops and do everything by hand

I will take this question to be 'How can I remove the two inner for loops if possible?'.

First off, you change the row slice from 'the first three' to 'the first four' in your question. And so post all your code. Don't make us guess. Anyway, this is what I think you want to give us:

import csv
def add_i(row, i):
 if row[0:3] == range(row[0], row[0] +3 * i, i):
 return True
row_dup = []
with open("out_file", "w") as out_file:
 writer =csv.writer(out_file, delimiter=',')
 with open("test1.csv") as csv_file:
 for row in csv.reader(csv_file, delimiter=','):
 row = [int(x) for x in row]
 for i in range (1,11):
 if add_i (row, i) == True:
 break
 elif row not in (row_dup):
 row_dup.append(row)
 writer.writerow(row)

First, we don't want this to be the worst version of cp, and so we will use the for else statement.

for i in range(1,11):
 if add_i(row, i):
 break
else:
 if row not in row_dup:
 # ...

This makes it so that [2, 4, 6] gets removed as well. This is as you would have a bug, so that in add_i, range(2, 5, 1) would return False, and add it to the output.

we can make the for statement, just an if statement. This is as you use range(start, stop, step). I will use i, j and k for start, stop and step respectfully. We can say that range returns this sort of structure [i, i+k, i+2k, ... i+((j-i)//k)*k]. As you use range(row[0], row[0] + 3*i, i) we know that is [i, i+k, i+2k] where i and k are row[0] and i respectfully. As this is getting rather complicated. Here is how I would implement it.

def add_i(row, range):
 step = row[1] - row[0]
 if range[0] <= step < range[1]:
 return step == (row[2] - row[1])
 return False

This while not comparing lists uses the logic, as the difference between them all has to be step, and step has to be in the allowed range.

Now that we have gotten rid of one of the two internal for loops, we need to get rid of the other, for this to be what I would deem is the 'correct' answer. This loop is a list comprehension, and if you have only 3 columns in all the rows, it would be a relatively fast and a really easy conversion. However, I am guessing that is not the case. And so if we just remove it, we will have to manually convert the columns into numbers. We can just do this in add_i again.

def add_i(row, range):
 r1 = int(row[1])
 step = r1 - int(row[0])
 if range[0] <= step < range[1]:
 return step == (int(row[2]) - r1)
 return False

Edit, I thought I should make a benchmark for sets, and found it was very fast.

row_dup = set()
# ... in the for loop
row_ = ''.join(row)
if not add_i(row, (1, 11)) and row_ not in row_dup:
 row_dup.add(row_)

This allows for very fast look-ups. And only has a ~27% impact on speed. The way that we make it so that we can use sets, is by making the data hash-able. Strings are hash-able, and with the change in the program, where the data in the row is all strings, it allows you to simply str.join it all together.

And so we have an answer with only one for loop.

import csv
def add_i(row, range):
 r1 = int(row[1])
 step = r1 - int(row[0])
 if range[0] <= step < range[1]:
 return step == (int(row[2]) - r1)
 return False
row_dup = set()
with open("out_file", "w") as out_file:
 writer = csv.writer(out_file, delimiter=',')
 with open("test1.csv") as csv_file:
 for row in csv.reader(csv_file, delimiter=','):
 row_ = ''.join(row)
 if not add_i(row, (1, 11)) and row_ not in row_dup:
 row_dup.add(row_)
 writer.writerow(row)

This is roughly O(n). And you can add a list comprehension to allow more steps easily.

def add_i(row, range):
 r0 = int(row[0])
 step = int(row[1]) - r0
 if range[0] <= step < range[1]:
 return row[1:3] == [str(i) for i in range(r0 + step, r0 + 3*step, step)]
 return False

As you seem to want to look over more than just 3 columns, due to your example with four, using a list comprehension is probably the way how you would want to do this.

If you expand on this then you would want to change add_i to only take a row, and be called something like format_test. This way, you are testing if the row follows a format, and if it does not then you add it, if not already added, to the new file. You can also now tell what the main part of the function is doing better, as it is saying, 'if this row passes this test, add it to the out file.'

Also, if you want some benchmarks. I used 100000 random 3 column csv lines, of numbers up to 10000.

New program with row_dup as type list:

real 3m37.109s
user 3m36.857s
sys 0m0.057s

New program with row_dup as type set:

real 0m0.453s
user 0m0.440s
sys 0m0.010s

New program without row_dup:

real 0m0.357s
user 0m0.350s
sys 0m0.000s

Old program without row_dup:

real 0m1.321s
user 0m1.290s
sys 0m0.010s

Question 8

perhaps I'll remove it, let duplicates be written, then remove them using sort file.csv | uniq perhaps its my best bet

Peilonrayz ♦Peilonrayz 44.4k7 gold badges80 silver badges157 bronze badges · Accepted Answer · 2015-07-02 17:43:14Z

This is desperate plan and the most annoying, I get rid of 2 for loops and do everything by hand

I will take this question to be 'How can I remove the two inner for loops if possible?'.

First off, you change the row slice from 'the first three' to 'the first four' in your question. And so post all your code. Don't make us guess. Anyway, this is what I think you want to give us:

import csv
def add_i(row, i):
 if row[0:3] == range(row[0], row[0] +3 * i, i):
 return True
row_dup = []
with open("out_file", "w") as out_file:
 writer =csv.writer(out_file, delimiter=',')
 with open("test1.csv") as csv_file:
 for row in csv.reader(csv_file, delimiter=','):
 row = [int(x) for x in row]
 for i in range (1,11):
 if add_i (row, i) == True:
 break
 elif row not in (row_dup):
 row_dup.append(row)
 writer.writerow(row)

First, we don't want this to be the worst version of cp, and so we will use the for else statement.

for i in range(1,11):
 if add_i(row, i):
 break
else:
 if row not in row_dup:
 # ...

This makes it so that [2, 4, 6] gets removed as well. This is as you would have a bug, so that in add_i, range(2, 5, 1) would return False, and add it to the output.

we can make the for statement, just an if statement. This is as you use range(start, stop, step). I will use i, j and k for start, stop and step respectfully. We can say that range returns this sort of structure [i, i+k, i+2k, ... i+((j-i)//k)*k]. As you use range(row[0], row[0] + 3*i, i) we know that is [i, i+k, i+2k] where i and k are row[0] and i respectfully. As this is getting rather complicated. Here is how I would implement it.

def add_i(row, range):
 step = row[1] - row[0]
 if range[0] <= step < range[1]:
 return step == (row[2] - row[1])
 return False

This while not comparing lists uses the logic, as the difference between them all has to be step, and step has to be in the allowed range.

Now that we have gotten rid of one of the two internal for loops, we need to get rid of the other, for this to be what I would deem is the 'correct' answer. This loop is a list comprehension, and if you have only 3 columns in all the rows, it would be a relatively fast and a really easy conversion. However, I am guessing that is not the case. And so if we just remove it, we will have to manually convert the columns into numbers. We can just do this in add_i again.

def add_i(row, range):
 r1 = int(row[1])
 step = r1 - int(row[0])
 if range[0] <= step < range[1]:
 return step == (int(row[2]) - r1)
 return False

Edit, I thought I should make a benchmark for sets, and found it was very fast.

row_dup = set()
# ... in the for loop
row_ = ''.join(row)
if not add_i(row, (1, 11)) and row_ not in row_dup:
 row_dup.add(row_)

This allows for very fast look-ups. And only has a ~27% impact on speed. The way that we make it so that we can use sets, is by making the data hash-able. Strings are hash-able, and with the change in the program, where the data in the row is all strings, it allows you to simply str.join it all together.

And so we have an answer with only one for loop.

import csv
def add_i(row, range):
 r1 = int(row[1])
 step = r1 - int(row[0])
 if range[0] <= step < range[1]:
 return step == (int(row[2]) - r1)
 return False
row_dup = set()
with open("out_file", "w") as out_file:
 writer = csv.writer(out_file, delimiter=',')
 with open("test1.csv") as csv_file:
 for row in csv.reader(csv_file, delimiter=','):
 row_ = ''.join(row)
 if not add_i(row, (1, 11)) and row_ not in row_dup:
 row_dup.add(row_)
 writer.writerow(row)

This is roughly O(n). And you can add a list comprehension to allow more steps easily.

def add_i(row, range):
 r0 = int(row[0])
 step = int(row[1]) - r0
 if range[0] <= step < range[1]:
 return row[1:3] == [str(i) for i in range(r0 + step, r0 + 3*step, step)]
 return False

As you seem to want to look over more than just 3 columns, due to your example with four, using a list comprehension is probably the way how you would want to do this.

If you expand on this then you would want to change add_i to only take a row, and be called something like format_test. This way, you are testing if the row follows a format, and if it does not then you add it, if not already added, to the new file. You can also now tell what the main part of the function is doing better, as it is saying, 'if this row passes this test, add it to the out file.'

Also, if you want some benchmarks. I used 100000 random 3 column csv lines, of numbers up to 10000.

New program with row_dup as type list:

real 3m37.109s
user 3m36.857s
sys 0m0.057s

New program with row_dup as type set:

real 0m0.453s
user 0m0.440s
sys 0m0.010s

New program without row_dup:

real 0m0.357s
user 0m0.350s
sys 0m0.000s

Old program without row_dup:

real 0m1.321s
user 0m1.290s
sys 0m0.010s

perhaps I'll remove it, let duplicates be written, then remove them using sort file.csv | uniq perhaps its my best bet

Stack Exchange Network

Reading, writing and filtering a CSV file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Reading, writing and filtering a CSV file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions