Improve efficiency for finding word frequency in text

Question 1

Premise: Write a program that counts the frequencies of each word in a text, and output each word with its count and line numbers where it appears. We define a word as a contiguous sequence of non-white-space characters. Different capitalizations of the same character sequence should be considered same word (e.g. Python and python). The output is formatted as follows: each line begins with a number indicating the frequency of the word, a white space, then the word itself, and a list of line numbers containing this word. You should output from the most frequent word to the least frequent. In case two words have the same frequency, the lexicographically smaller one comes first. All words are in lower case in the output.

Solution:

# Amazing Python libraries that prevent re-invention of wheels.
import re
import collections
import string
from operator import itemgetter
# Simulate work for terminal
print 'Scanning input for word frequency...'
# Text files
input = open('input.txt', 'r')
output = open('output.txt', 'w')
# Reads every word in input as lowercase while ignoring punctuation and
# returns a list containing 2-tuples of (word, frequency).
list = collections.Counter(input.read().lower()
 .translate(None, string.punctuation).split()).most_common()
# Sorts the list by frequency in descending order.
list = sorted(list, key=itemgetter(0))
# Sorts the list by lexicogrpahical order in descending order while maintaining
# previous sorting.
list = sorted(list, key=itemgetter(1), reverse=True)
# Gets the lines where each word in list appears in the input.
for word in list:
 lineList = []
 # Reads the input line by line. Stores each text line in 'line' and keeps a 
 # counter in 'lineNum' that starts at 1.
 for lineNum, line in enumerate(open('input.txt'),1):
 # If the word is in the current line then add it to the list of lines 
 # in which the current word appears in. The word frequency list is not 
 # mutated. This is stored in a separate list.
 if re.search(r'(^|\s)'+word[0]+'\s', 
 line.lower().translate(None, string.punctuation), 
 flags=re.IGNORECASE):
 lineList.append(lineNum)
 # Write output with proper format.
 output.write(str(word[1]) + ' ' + word[0] + ' ' + str(lineList)+'\n')
# End work simulation for terminal
print 'Done! Output in: output.txt'

How can I get the line numbers without re-reading the text? I want to achieve sub-O(n^2) time complexity. I am new to Python so feedback on other aspects is greatly appreciated as well.

Question 2

You're right, the basic point to make this program faster is to store the line numbers in the first pass.

The for word in list loop will run many times. The count depends on how many words you have. Every time it runs, it reopens and rereads the whole input file. This is awfully slow.

If you start using a more generic data structure than your Counter(), then you'll be able to store the line numbers. For example you could use a dictionary. The key of the dict is the word, the value is a list of line numbers. The hit count is stored indirectly as the length of the list of line numbers.

In one pass over the input you can populate this data structure. In a second step, you can properly sort this data structure and iterate over the elements.

An example implementation:

from collections import defaultdict
import sys 
if __name__ == "__main__":
 hit = defaultdict(list)
 for lineno, line in enumerate(sys.stdin, start=1):
 for word in line.split():
 hit[word.lower()].append(lineno)
 for word, places in sorted(hit.iteritems(), key=lambda (w, p): (-len(p), w)):
 print len(places), word, ",".join(map(str, sorted(set(places))))

A few more notes:

If you open() a file, close() it too. As soon as you're done with using it. (Or use the with statement to ensure that it gets closed.)
Don't hardcode input/output file names. Read from standard input (sys.stdin), write to standard output (sys.stdout or simply print), and let the user use his shell's redirections: program <input.txt >output.txt
Don't litter standard output with progress messages. If you absolutely need them use sys.stderr.write or get acquainted with the logging module.
Don't use regular expressions, unless it's necessary. Prefer string methods for their simplicity and speed.
Name your variables in lower_case_with_underscores as PEP8 recommends.

Question 3

Great review. Are there any performance reasons for note 2? Also, when you say "let the user use his shell's redirections: program <input.txt >output.txt" does that mean that when I run my program from the terminal I can pass the filenames as arguments?

Question 4

There are cases when (2) can help in better performance. In a pipeline like prg1 <input.txt | prg2 >output.txt prg1 and prg2 are started in parallel (at least in unix-like OSes). prg2 does not have to wait for prg1 to finish, but can start to work when prg1 produced any output.

Question 5

Can I pass the filenames as arguments? Yes, you can specify the filenames in the terminal. To be precise, these are not arguments, they are not processed by your program. They are called shell redirections and are processed by your command shell. They are extra useful, google them.

Question 6

@Gareth Rees: I'm not sure whether you have noticed, but with your edit you introduced a little semantic change. The edited code counts a word at most once per line. Previously (as in the OP) it counted multiple occurences in the same line, but later printed the line number only once.

Question 7

@rubasov: You're quite right. That was my mistake.

rubasov rubasov 2161 silver badge3 bronze badges · Accepted Answer · 2013-03-09 22:39:58Z

You're right, the basic point to make this program faster is to store the line numbers in the first pass.

The for word in list loop will run many times. The count depends on how many words you have. Every time it runs, it reopens and rereads the whole input file. This is awfully slow.

If you start using a more generic data structure than your Counter(), then you'll be able to store the line numbers. For example you could use a dictionary. The key of the dict is the word, the value is a list of line numbers. The hit count is stored indirectly as the length of the list of line numbers.

In one pass over the input you can populate this data structure. In a second step, you can properly sort this data structure and iterate over the elements.

An example implementation:

from collections import defaultdict
import sys 
if __name__ == "__main__":
 hit = defaultdict(list)
 for lineno, line in enumerate(sys.stdin, start=1):
 for word in line.split():
 hit[word.lower()].append(lineno)
 for word, places in sorted(hit.iteritems(), key=lambda (w, p): (-len(p), w)):
 print len(places), word, ",".join(map(str, sorted(set(places))))

A few more notes:

If you open() a file, close() it too. As soon as you're done with using it. (Or use the with statement to ensure that it gets closed.)
Don't hardcode input/output file names. Read from standard input (sys.stdin), write to standard output (sys.stdout or simply print), and let the user use his shell's redirections: program <input.txt >output.txt
Don't litter standard output with progress messages. If you absolutely need them use sys.stderr.write or get acquainted with the logging module.
Don't use regular expressions, unless it's necessary. Prefer string methods for their simplicity and speed.
Name your variables in lower_case_with_underscores as PEP8 recommends.

Great review. Are there any performance reasons for note 2? Also, when you say "let the user use his shell's redirections: program <input.txt >output.txt" does that mean that when I run my program from the terminal I can pass the filenames as arguments?
There are cases when (2) can help in better performance. In a pipeline like prg1 <input.txt | prg2 >output.txt prg1 and prg2 are started in parallel (at least in unix-like OSes). prg2 does not have to wait for prg1 to finish, but can start to work when prg1 produced any output.
Can I pass the filenames as arguments? Yes, you can specify the filenames in the terminal. To be precise, these are not arguments, they are not processed by your program. They are called shell redirections and are processed by your command shell. They are extra useful, google them.
@Gareth Rees: I'm not sure whether you have noticed, but with your edit you introduced a little semantic change. The edited code counts a word at most once per line. Previously (as in the OP) it counted multiple occurences in the same line, but later printed the line number only once.

Stack Exchange Network

Improve efficiency for finding word frequency in text

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Improve efficiency for finding word frequency in text

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions