I have a CSV with 5+ million rows and I want to filter it:
with open("test1.csv") as csv_file:
writer =csv.writer(outputfile, delimiter=',')
for row in csv.reader(csv_file, delimiter=','):
row = [int(x) for x in row]
for i in range (1,11):
if add_i (row) == True:
break
elif row not in (row_dup):
row_dup.append(row)
writer.writerow(row)
I have many functions, the simplest one being add_i
which you can see in the snippet above.
if row[0:3] == range(row[0], row[0] +3 * i, i):
return True
row_dup = []
is an empty array I use to store duplicates so that they don't get written to the file.
I want to improve the execution time of my script, but it takes hours. According to the calculation of a friend, it takes 530 hours or so. I'm not sure about that but regardless, it is still taking a long time.
How I tried to improve the speed of the script:
- Using PyPy but I didn't notice a significant improvement.
- Removing
row = [int(x) for x in row]
but then I'd have to writeif [int(x) for x in row[0:4]] == range(int(row[0]), int(row[0]) + 4 * i, i):
it's annoying and there's still afor
loop. - Removing
for i in range (1,11)
but then I would have to write the indexes myself. Removing
for i in range (1,11)
androw = [int(x) for x in row]
and doing it like soif (int(row[0]) + 1 == int(row[1])) and int(row[1]) + 1 == int(row[2]) and int(row[2]) + 1 == int(row[3]) and int(row[3]) + 1 == int(row[4]): do something
This is a desperate plan and the most annoying. I get rid of two
for
loops and do everything by hand, increasing the index value from 1 to 11 by copy pasting.Considered using Hadoop but the VPS I have are so much less powerful than my PC, low end box.
Considered using SQLite instead of CSV but not convinced it would make things much faster.
Considered writing it in C++, and in fact the first version was in C++ but moved to Python since I didn't notice a big difference and because it's easier to plot in Python.
And by the way, I'm on Manjaro 64 bit, filesystem Ext4, Python 2.7, running in virtualenv. I'm only using a CSV module. I have an HP Pavilion G6 laptop, if you want to see the specs of it.
Note that the code is taking that much time, even though I'm only testing with add_i
so I didn't even bother testing the complicated function. Who knows how much time it would take then.
-
7\$\begingroup\$ If you have a CSV file with five million or more rows, it may be time to upgrade to a real database. \$\endgroup\$Ethan Bierlein– Ethan Bierlein2015年07月01日 23:50:29 +00:00Commented Jul 1, 2015 at 23:50
-
\$\begingroup\$ @EthanBierlein would it improve the execution time? the reason I don't use a database is because I'm the only one using the data and I keep a backup of the csv and it's only on my PC, no websites \$\endgroup\$Lynob– Lynob2015年07月01日 23:52:19 +00:00Commented Jul 1, 2015 at 23:52
-
\$\begingroup\$ @Lynob Yes. A database using a query language like SQL would be much faster, simply because it's highly optimized. You can always just keep your data on your localhost as well. \$\endgroup\$Ethan Bierlein– Ethan Bierlein2015年07月01日 23:53:42 +00:00Commented Jul 1, 2015 at 23:53
-
\$\begingroup\$ @Lynob I don't see much else that can really be improved in this little snippet. \$\endgroup\$Ethan Bierlein– Ethan Bierlein2015年07月01日 23:55:32 +00:00Commented Jul 1, 2015 at 23:55
-
\$\begingroup\$ @Quill True but I don't think it could be done in less than 3 days or so \$\endgroup\$Lynob– Lynob2015年07月02日 00:00:35 +00:00Commented Jul 2, 2015 at 0:00
1 Answer 1
This is desperate plan and the most annoying, I get rid of 2 for loops and do everything by hand
I will take this question to be 'How can I remove the two inner for loops if possible?'.
First off, you change the row slice from 'the first three' to 'the first four' in your question. And so post all your code. Don't make us guess. Anyway, this is what I think you want to give us:
import csv
def add_i(row, i):
if row[0:3] == range(row[0], row[0] +3 * i, i):
return True
row_dup = []
with open("out_file", "w") as out_file:
writer =csv.writer(out_file, delimiter=',')
with open("test1.csv") as csv_file:
for row in csv.reader(csv_file, delimiter=','):
row = [int(x) for x in row]
for i in range (1,11):
if add_i (row, i) == True:
break
elif row not in (row_dup):
row_dup.append(row)
writer.writerow(row)
First, we don't want this to be the worst version of cp
, and so we will use the for else
statement.
for i in range(1,11):
if add_i(row, i):
break
else:
if row not in row_dup:
# ...
This makes it so that [2, 4, 6]
gets removed as well. This is as you would have a bug, so that in add_i
, range(2, 5, 1)
would return False
, and add it to the output.
we can make the for
statement, just an if statement. This is as you use range(start, stop, step)
. I will use i
, j
and k
for start, stop and step respectfully.
We can say that range returns this sort of structure [i, i+k, i+2k, ... i+((j-i)//k)*k]
. As you use range(row[0], row[0] + 3*i, i)
we know that is [i, i+k, i+2k]
where i
and k
are row[0]
and i
respectfully.
As this is getting rather complicated. Here is how I would implement it.
def add_i(row, range):
step = row[1] - row[0]
if range[0] <= step < range[1]:
return step == (row[2] - row[1])
return False
This while not comparing lists uses the logic, as the difference between them all has to be step
, and step
has to be in the allowed range.
Now that we have gotten rid of one of the two internal for loops, we need to get rid of the other, for this to be what I would deem is the 'correct' answer.
This loop is a list comprehension, and if you have only 3 columns in all the rows, it would be a relatively fast and a really easy conversion. However, I am guessing that is not the case.
And so if we just remove it, we will have to manually convert the columns into numbers. We can just do this in add_i
again.
def add_i(row, range):
r1 = int(row[1])
step = r1 - int(row[0])
if range[0] <= step < range[1]:
return step == (int(row[2]) - r1)
return False
Edit, I thought I should make a benchmark for sets, and found it was very fast.
row_dup = set()
# ... in the for loop
row_ = ''.join(row)
if not add_i(row, (1, 11)) and row_ not in row_dup:
row_dup.add(row_)
This allows for very fast look-ups. And only has a ~27% impact on speed. The way that we make it so that we can use sets, is by making the data hash-able. Strings are hash-able, and with the change in the program, where the data in the row
is all strings, it allows you to simply str.join
it all together.
And so we have an answer with only one for loop.
import csv
def add_i(row, range):
r1 = int(row[1])
step = r1 - int(row[0])
if range[0] <= step < range[1]:
return step == (int(row[2]) - r1)
return False
row_dup = set()
with open("out_file", "w") as out_file:
writer = csv.writer(out_file, delimiter=',')
with open("test1.csv") as csv_file:
for row in csv.reader(csv_file, delimiter=','):
row_ = ''.join(row)
if not add_i(row, (1, 11)) and row_ not in row_dup:
row_dup.add(row_)
writer.writerow(row)
This is roughly O(n). And you can add a list comprehension to allow more steps easily.
def add_i(row, range):
r0 = int(row[0])
step = int(row[1]) - r0
if range[0] <= step < range[1]:
return row[1:3] == [str(i) for i in range(r0 + step, r0 + 3*step, step)]
return False
As you seem to want to look over more than just 3 columns, due to your example with four, using a list comprehension is probably the way how you would want to do this.
If you expand on this then you would want to change add_i
to only take a row, and be called something like format_test
. This way, you are testing if the row follows a format, and if it does not then you add it, if not already added, to the new file. You can also now tell what the main part of the function is doing better, as it is saying, 'if this row passes this test, add it to the out file.'
Also, if you want some benchmarks. I used 100000 random 3 column csv lines, of numbers up to 10000.
New program with row_dup
as type list
:
real 3m37.109s
user 3m36.857s
sys 0m0.057s
New program with row_dup
as type set
:
real 0m0.453s
user 0m0.440s
sys 0m0.010s
New program without row_dup
:
real 0m0.357s
user 0m0.350s
sys 0m0.000s
Old program without row_dup
:
real 0m1.321s
user 0m1.290s
sys 0m0.010s
-
\$\begingroup\$ perhaps I'll remove it, let duplicates be written, then remove them using
sort file.csv | uniq
perhaps its my best bet \$\endgroup\$Lynob– Lynob2015年07月02日 20:47:52 +00:00Commented Jul 2, 2015 at 20:47