Frequency analysis for counting keywords

Question 1

I have a CSV with around 10,000,000 lines. Each line looks like:

keyword1, keyword2, ....keywordn
keyword1, keyword2, ....keywordn
keyword1, keyword2, ....keywordn
keyword1, keyword2, ....keywordn

The script which does a frequency count and outputs to another text file:

from collections import Counter
import csv
import bz2
import sys
csv.field_size_limit(999999999)
all_terms = []
#with open('keywords.csv', 'rb') as f:
with bz2.BZ2File("keywords.csv.bz2", "r") as f:
 reader = csv.reader(f)
 for idx, row in enumerate(reader):
 for term in row:
 terms = term.split(",")
 for x in terms:
 all_terms.append( x.strip().lower() )
 if idx % 100000 == 0:
 print idx
c = Counter(all_terms)
with open("counts.txt", 'w+') as fl:
 for k,v in c.most_common():
 fl.write( "{} | {}\n".format(k,v) )

I can't think of a way to minimize the memory utilization using generators and I'm not sure I could do a better job with the frequency counting job than the Counter class from the collections module.

Question 2

As you wrote it, the lines contain the same keywords and they're also of the same length. Is that the case ?

Question 3

Hi @Dex'ter - no, they may all be variable length and each line will have some variable number of keywords.

Question 4

I have rolled back Rev 5 → 3. Please see What to do when someone answers . I suggest that you either write a comment on @Graipher's answer or post a follow-up question.

Question 5

The obvious issue with this is that you first build the list of all keywords and then count them. It would make more sense to count them while you process each line. collections.Counter supports this with an overridden update method:

>>> c = Counter([1, 2, 3])
>>> c
Counter({1: 1, 2: 1, 3: 1})
>>> c.update([1, 4])
>>> c
Counter({1: 2, 2: 1, 3: 1, 4: 1})

Which you can use in your case like this:

from collections import Counter
import csv
import bz2
csv.field_size_limit(999999999)
c = Counter()
with bz2.BZ2File("keywords.csv.bz2") as f:
 reader = csv.reader(f)
 for idx, row in enumerate(reader):
 c.update(term.strip().lower() for term in row)
 if idx % 100000 == 0:
 print idx
with open("counts.txt", 'w+') as fl:
 for k, v in c.most_common(): 
 fl.write("{} | {}\n".format(k, v))

I don't see the reason for for term in row, the CSV reader already splits the row into a list, so I left it out/moved it into the generator.

You should test the performance of

c.update(term.strip().lower() for term in row)

vs

c.update(map(lower, map(strip, row)))

vs

c.update(map(lambda term: term.strip().lower(), row))

with your file, because I'm not sure how well the double map is implemented in Python 2.x in comparison to using a lambda or a generator expression.

As an additional aside, it might makes sense to not do the update on every iteration, but buffer it for some number of counts. You could use the print as an already existing check to write it out. With this code:

from collections import Counter
import csv
import bz2
csv.field_size_limit(999999999)
c = Counter()
buf = []
with bz2.BZ2File("keywords.csv.bz2") as f:
 reader = csv.reader(f)
 for idx, row in enumerate(reader):
 buf.extend(term.strip().lower() for term in row)
 if idx % 100000 == 0:
 print idx
 c.update(buf)
 buf = []
 c.update(buf)
with open("counts.txt", 'w+') as fl:
 for k, v in c.most_common():
 fl.write("{} | {}\n".format(k, v))

I get some 25% speed increase on my local test file (which has a lot of repeated lines). At this point it becomes a trade-off between the cost of update and extend. in my normal code, this determines the runtime:

29877245 function calls (29877242 primitive calls) in 8.559 seconds
...
999181 2.756 0.000 7.515 0.000 collections.py:528(update)

For the second code it is somewhat shared between update and extend and lower in the sum

23882242 function calls (23882239 primitive calls) in 5.870 seconds
...
 12 1.421 0.118 1.791 0.149 collections.py:528(update)
...
999180 0.779 0.000 3.065 0.000 {method 'extend' of 'list' objects}

I also tried basically your approach, updating the counter only at the end. This is about another 0.2 seconds faster at the cost of potentially unlimited memory usage.

Question 6

Brilliant - I had a feeling something like this would work but I could not conceptualize. Let me test out and report back.

Question 7

Graipher what are you using to profile your code?

Question 8

@jrjames83 Just the built-in profiler: python - m cProfile path/to/script.py

Question 9

i don't know how you're supposed to use this site :). In any event, I tried sending the data to the counter class incrementally and in running the profiler, it doubled the time spent dealing with the class. It seems as long as RAM is not an issue (the RAM utilization peaks at around 8gb when I run it) it's faster to dump it into the Counter from memory once the parsing is completed.

Graipher Graipher 41.7k7 gold badges70 silver badges134 bronze badges · Answer 1 · 2016-11-29 07:48:04Z

The obvious issue with this is that you first build the list of all keywords and then count them. It would make more sense to count them while you process each line. collections.Counter supports this with an overridden update method:

>>> c = Counter([1, 2, 3])
>>> c
Counter({1: 1, 2: 1, 3: 1})
>>> c.update([1, 4])
>>> c
Counter({1: 2, 2: 1, 3: 1, 4: 1})

Which you can use in your case like this:

from collections import Counter
import csv
import bz2
csv.field_size_limit(999999999)
c = Counter()
with bz2.BZ2File("keywords.csv.bz2") as f:
 reader = csv.reader(f)
 for idx, row in enumerate(reader):
 c.update(term.strip().lower() for term in row)
 if idx % 100000 == 0:
 print idx
with open("counts.txt", 'w+') as fl:
 for k, v in c.most_common(): 
 fl.write("{} | {}\n".format(k, v))

I don't see the reason for for term in row, the CSV reader already splits the row into a list, so I left it out/moved it into the generator.

You should test the performance of

c.update(term.strip().lower() for term in row)

vs

c.update(map(lower, map(strip, row)))

vs

c.update(map(lambda term: term.strip().lower(), row))

with your file, because I'm not sure how well the double map is implemented in Python 2.x in comparison to using a lambda or a generator expression.

As an additional aside, it might makes sense to not do the update on every iteration, but buffer it for some number of counts. You could use the print as an already existing check to write it out. With this code:

from collections import Counter
import csv
import bz2
csv.field_size_limit(999999999)
c = Counter()
buf = []
with bz2.BZ2File("keywords.csv.bz2") as f:
 reader = csv.reader(f)
 for idx, row in enumerate(reader):
 buf.extend(term.strip().lower() for term in row)
 if idx % 100000 == 0:
 print idx
 c.update(buf)
 buf = []
 c.update(buf)
with open("counts.txt", 'w+') as fl:
 for k, v in c.most_common():
 fl.write("{} | {}\n".format(k, v))

I get some 25% speed increase on my local test file (which has a lot of repeated lines). At this point it becomes a trade-off between the cost of update and extend. in my normal code, this determines the runtime:

29877245 function calls (29877242 primitive calls) in 8.559 seconds
...
999181 2.756 0.000 7.515 0.000 collections.py:528(update)

For the second code it is somewhat shared between update and extend and lower in the sum

23882242 function calls (23882239 primitive calls) in 5.870 seconds
...
 12 1.421 0.118 1.791 0.149 collections.py:528(update)
...
999180 0.779 0.000 3.065 0.000 {method 'extend' of 'list' objects}

I also tried basically your approach, updating the counter only at the end. This is about another 0.2 seconds faster at the cost of potentially unlimited memory usage.

Brilliant - I had a feeling something like this would work but I could not conceptualize. Let me test out and report back.
@jrjames83 Just the built-in profiler: python - m cProfile path/to/script.py
i don't know how you're supposed to use this site :). In any event, I tried sending the data to the counter class incrementally and in running the profiler, it doubled the time spent dealing with the class. It seems as long as RAM is not an issue (the RAM utilization peaks at around 8gb when I run it) it's faster to dump it into the Counter from memory once the parsing is completed.

Stack Exchange Network

Frequency analysis for counting keywords

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Frequency analysis for counting keywords

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions