I have a CSV with around 10,000,000 lines. Each line looks like:
keyword1, keyword2, ....keywordn
keyword1, keyword2, ....keywordn
keyword1, keyword2, ....keywordn
keyword1, keyword2, ....keywordn
The script which does a frequency count and outputs to another text file:
from collections import Counter
import csv
import bz2
import sys
csv.field_size_limit(999999999)
all_terms = []
#with open('keywords.csv', 'rb') as f:
with bz2.BZ2File("keywords.csv.bz2", "r") as f:
reader = csv.reader(f)
for idx, row in enumerate(reader):
for term in row:
terms = term.split(",")
for x in terms:
all_terms.append( x.strip().lower() )
if idx % 100000 == 0:
print idx
c = Counter(all_terms)
with open("counts.txt", 'w+') as fl:
for k,v in c.most_common():
fl.write( "{} | {}\n".format(k,v) )
I can't think of a way to minimize the memory utilization using generators and I'm not sure I could do a better job with the frequency counting job than the Counter
class from the collections
module.
-
1\$\begingroup\$ As you wrote it, the lines contain the same keywords and they're also of the same length. Is that the case ? \$\endgroup\$Grajdeanu Alex– Grajdeanu Alex2016年11月29日 07:10:43 +00:00Commented Nov 29, 2016 at 7:10
-
\$\begingroup\$ Hi @Dex'ter - no, they may all be variable length and each line will have some variable number of keywords. \$\endgroup\$jrjames83– jrjames832016年11月29日 15:14:23 +00:00Commented Nov 29, 2016 at 15:14
-
\$\begingroup\$ I have rolled back Rev 5 → 3. Please see What to do when someone answers . I suggest that you either write a comment on @Graipher's answer or post a follow-up question. \$\endgroup\$200_success– 200_success2016年11月30日 03:37:00 +00:00Commented Nov 30, 2016 at 3:37
1 Answer 1
The obvious issue with this is that you first build the list of all keywords and then count them. It would make more sense to count them while you process each line. collections.Counter
supports this with an overridden update
method:
>>> c = Counter([1, 2, 3])
>>> c
Counter({1: 1, 2: 1, 3: 1})
>>> c.update([1, 4])
>>> c
Counter({1: 2, 2: 1, 3: 1, 4: 1})
Which you can use in your case like this:
from collections import Counter
import csv
import bz2
csv.field_size_limit(999999999)
c = Counter()
with bz2.BZ2File("keywords.csv.bz2") as f:
reader = csv.reader(f)
for idx, row in enumerate(reader):
c.update(term.strip().lower() for term in row)
if idx % 100000 == 0:
print idx
with open("counts.txt", 'w+') as fl:
for k, v in c.most_common():
fl.write("{} | {}\n".format(k, v))
I don't see the reason for for term in row
, the CSV reader already splits the row into a list, so I left it out/moved it into the generator.
You should test the performance of
c.update(term.strip().lower() for term in row)
vs
c.update(map(lower, map(strip, row)))
vs
c.update(map(lambda term: term.strip().lower(), row))
with your file, because I'm not sure how well the double map is implemented in Python 2.x in comparison to using a lambda or a generator expression.
As an additional aside, it might makes sense to not do the update
on every iteration, but buffer it for some number of counts. You could use the print as an already existing check to write it out. With this code:
from collections import Counter
import csv
import bz2
csv.field_size_limit(999999999)
c = Counter()
buf = []
with bz2.BZ2File("keywords.csv.bz2") as f:
reader = csv.reader(f)
for idx, row in enumerate(reader):
buf.extend(term.strip().lower() for term in row)
if idx % 100000 == 0:
print idx
c.update(buf)
buf = []
c.update(buf)
with open("counts.txt", 'w+') as fl:
for k, v in c.most_common():
fl.write("{} | {}\n".format(k, v))
I get some 25% speed increase on my local test file (which has a lot of repeated lines). At this point it becomes a trade-off between the cost of update
and extend
. in my normal code, this determines the runtime:
29877245 function calls (29877242 primitive calls) in 8.559 seconds
...
999181 2.756 0.000 7.515 0.000 collections.py:528(update)
For the second code it is somewhat shared between update
and extend
and lower in the sum
23882242 function calls (23882239 primitive calls) in 5.870 seconds
...
12 1.421 0.118 1.791 0.149 collections.py:528(update)
...
999180 0.779 0.000 3.065 0.000 {method 'extend' of 'list' objects}
I also tried basically your approach, updating the counter only at the end. This is about another 0.2 seconds faster at the cost of potentially unlimited memory usage.
-
\$\begingroup\$ Brilliant - I had a feeling something like this would work but I could not conceptualize. Let me test out and report back. \$\endgroup\$jrjames83– jrjames832016年11月29日 15:19:03 +00:00Commented Nov 29, 2016 at 15:19
-
\$\begingroup\$ Graipher what are you using to profile your code? \$\endgroup\$jrjames83– jrjames832016年11月29日 17:24:36 +00:00Commented Nov 29, 2016 at 17:24
-
\$\begingroup\$ @jrjames83 Just the built-in profiler:
python - m cProfile path/to/script.py
\$\endgroup\$Graipher– Graipher2016年11月29日 17:26:40 +00:00Commented Nov 29, 2016 at 17:26 -
\$\begingroup\$ i don't know how you're supposed to use this site :). In any event, I tried sending the data to the counter class incrementally and in running the profiler, it doubled the time spent dealing with the class. It seems as long as RAM is not an issue (the RAM utilization peaks at around 8gb when I run it) it's faster to dump it into the Counter from memory once the parsing is completed. \$\endgroup\$jrjames83– jrjames832016年11月30日 04:00:51 +00:00Commented Nov 30, 2016 at 4:00