Premise: Write a program that counts the frequencies of each word in a text, and output each word with its count and line numbers where it appears. We define a word as a contiguous sequence of non-white-space characters. Different capitalizations of the same character sequence should be considered same word (e.g. Python and python). The output is formatted as follows: each line begins with a number indicating the frequency of the word, a white space, then the word itself, and a list of line numbers containing this word. You should output from the most frequent word to the least frequent. In case two words have the same frequency, the lexicographically smaller one comes first. All words are in lower case in the output.
Solution:
# Amazing Python libraries that prevent re-invention of wheels.
import re
import collections
import string
from operator import itemgetter
# Simulate work for terminal
print 'Scanning input for word frequency...'
# Text files
input = open('input.txt', 'r')
output = open('output.txt', 'w')
# Reads every word in input as lowercase while ignoring punctuation and
# returns a list containing 2-tuples of (word, frequency).
list = collections.Counter(input.read().lower()
.translate(None, string.punctuation).split()).most_common()
# Sorts the list by frequency in descending order.
list = sorted(list, key=itemgetter(0))
# Sorts the list by lexicogrpahical order in descending order while maintaining
# previous sorting.
list = sorted(list, key=itemgetter(1), reverse=True)
# Gets the lines where each word in list appears in the input.
for word in list:
lineList = []
# Reads the input line by line. Stores each text line in 'line' and keeps a
# counter in 'lineNum' that starts at 1.
for lineNum, line in enumerate(open('input.txt'),1):
# If the word is in the current line then add it to the list of lines
# in which the current word appears in. The word frequency list is not
# mutated. This is stored in a separate list.
if re.search(r'(^|\s)'+word[0]+'\s',
line.lower().translate(None, string.punctuation),
flags=re.IGNORECASE):
lineList.append(lineNum)
# Write output with proper format.
output.write(str(word[1]) + ' ' + word[0] + ' ' + str(lineList)+'\n')
# End work simulation for terminal
print 'Done! Output in: output.txt'
How can I get the line numbers without re-reading the text? I want to achieve sub-O(n^2) time complexity. I am new to Python so feedback on other aspects is greatly appreciated as well.
1 Answer 1
You're right, the basic point to make this program faster is to store the line numbers in the first pass.
The for word in list
loop will run many times. The count depends on how many words you have. Every time it runs, it reopens and rereads the whole input file. This is awfully slow.
If you start using a more generic data structure than your Counter()
, then you'll be able to store the line numbers. For example you could use a dictionary. The key of the dict is the word, the value is a list of line numbers. The hit count is stored indirectly as the length of the list of line numbers.
In one pass over the input you can populate this data structure. In a second step, you can properly sort this data structure and iterate over the elements.
An example implementation:
from collections import defaultdict
import sys
if __name__ == "__main__":
hit = defaultdict(list)
for lineno, line in enumerate(sys.stdin, start=1):
for word in line.split():
hit[word.lower()].append(lineno)
for word, places in sorted(hit.iteritems(), key=lambda (w, p): (-len(p), w)):
print len(places), word, ",".join(map(str, sorted(set(places))))
A few more notes:
If you
open()
a file,close()
it too. As soon as you're done with using it. (Or use thewith
statement to ensure that it gets closed.)Don't hardcode input/output file names. Read from standard input (
sys.stdin
), write to standard output (sys.stdout
or simplyprint
), and let the user use his shell's redirections:program <input.txt >output.txt
Don't litter standard output with progress messages. If you absolutely need them use
sys.stderr.write
or get acquainted with thelogging
module.Don't use regular expressions, unless it's necessary. Prefer string methods for their simplicity and speed.
Name your variables in
lower_case_with_underscores
as PEP8 recommends.
-
\$\begingroup\$ Great review. Are there any performance reasons for note 2? Also, when you say "let the user use his shell's redirections:
program <input.txt >output.txt
" does that mean that when I run my program from the terminal I can pass the filenames as arguments? \$\endgroup\$user22969– user229692013年03月10日 05:38:24 +00:00Commented Mar 10, 2013 at 5:38 -
1\$\begingroup\$ There are cases when (2) can help in better performance. In a pipeline like
prg1 <input.txt | prg2 >output.txt
prg1 and prg2 are started in parallel (at least in unix-like OSes). prg2 does not have to wait for prg1 to finish, but can start to work when prg1 produced any output. \$\endgroup\$rubasov– rubasov2013年03月10日 09:21:06 +00:00Commented Mar 10, 2013 at 9:21 -
1\$\begingroup\$ Can I pass the filenames as arguments? Yes, you can specify the filenames in the terminal. To be precise, these are not arguments, they are not processed by your program. They are called
shell redirections
and are processed by your command shell. They are extra useful, google them. \$\endgroup\$rubasov– rubasov2013年03月10日 09:28:59 +00:00Commented Mar 10, 2013 at 9:28 -
\$\begingroup\$ @Gareth Rees: I'm not sure whether you have noticed, but with your edit you introduced a little semantic change. The edited code counts a word at most once per line. Previously (as in the OP) it counted multiple occurences in the same line, but later printed the line number only once. \$\endgroup\$rubasov– rubasov2013年03月11日 17:18:21 +00:00Commented Mar 11, 2013 at 17:18
-
\$\begingroup\$ @rubasov: You're quite right. That was my mistake. \$\endgroup\$Gareth Rees– Gareth Rees2013年03月11日 17:29:44 +00:00Commented Mar 11, 2013 at 17:29