Word frequency generator in Python

Question 1

Please critique my word frequency generator. I am a beginner programmer so any criticism are welcome.

Original Code: http://pastebin.com/rSRfbnCt

Here is my revised code after the feedback:

import string, time, webbrowser, collections, urllib2
def timefunc(function):
 def wrapped(*args):
 start = time.time()
 data = function(*args)
 end = time.time()
 timetaken = end - start
 print "Function: "+function.__name__+"\nTime taken:",timetaken
 return data
 return wrapped
@timefunc
def process_text(text_file):
 words = text_file.read().lower().split()
 words = [word.strip(string.punctuation+string.whitespace) for word in words]
 words = [word for word in words if word]#skips ''(None) elements
 return words
@timefunc
def create_freq_dict(wordlist):
 freq_dict = collections.Counter(wordlist)
 return freq_dict
@timefunc
def create_sorted_list(freqdict):
 sorted_list = [(value,key) for key,value in list(freqdict.items())]#list() provides python 3 compatibility
 sorted_list.sort(reverse=True)
 return sorted_list
@timefunc
def write_results(sorted_list):
 text_file = open('wordfreq.txt','w')
 text_file.write('Word Frequency List\n\n')
 rank = 0
 for word in sorted_list:
 rank += 1
 write_str = "[{0}] {1:-<10}{2:->10}\n".format(rank, word[1],word[0])
 text_file.write(write_str)
 text_file.close()
## The Brothers Grimm
## This file can be obtained from Project Gutenberg:
## http://www.gutenberg.org/cache/epub/5314/pg5314.txt
web_file = urllib2.urlopen('http://www.gutenberg.org/cache/epub/5314/pg5314.txt')
wordlist = process_text(web_file)
freqdict = create_freq_dict(wordlist)
sorted_list = create_sorted_list(freqdict)
results = write_results(sorted_list)
webbrowser.open('wordfreq.txt')
print "END"

Question 2

import string, time, math, webbrowser
def timefunc(function, *args):
 start = time.time()
 data = function(*args)
 end = time.time()
 timetaken = end - start

I'd recommend calling this time_taken as its slightly easier to read.

 print "Function: "+function.__name__+"\nTime taken:",timetaken

Print already introduces newlines and combines different pieces. Take advantage of that.

 print "Function: ", function.__name__
 print "Time Taken: ", time_taken

That's easier to follow

 return data
def process_text(filename):

You never use filename in here, but you do use fin which is the same thing. typo?

 t = []

Not a very descriptive name. I suggest coming up with something clearer

 for line in fin:
 for i in line.split():

i usually means index which its not here

 word = i.lower()
 word = word.strip(string.punctuation)
 if word != '':
 t.append(word)
 return t

I'd do this as

 words = fin.read().lower().split()
 words = [word.strip() for word in words]
 words = [word for word in words if word]
 return words

I think its easier to follow and probably more efficient

def create_freq_dict(wordlist):
 d = dict()

d is not a very good name. Usually dicts are created with {} not dict(). No difference, but the first is generally preffered

 for word in wordlist:
 if word not in d.keys():
 d[word] = 1
 else:
 d[word] += 1

Use d = collections.defaultdict(int) or d = collections.Counter(). Both will make it easier to count up like this. See the python documentation for collections. You should actually be able to write this function in one line

 return d
def sort_dict(in_dict):
 t = []
 for key,value in in_dict.items():
 t.append((value, key))

in_dict.items() is a list already, there is no reason to copy the elements into the list. (NOTE: in Python 3.x in_dict.items() is no longer a list). Even if it wasn't a list you could do:

 t = list(in_dict.items())

Which would do the same thing your code does

 t.sort(reverse=True)
 return t

I'd implement this function as

return sorted(in_dict.items())

The sorted function takes anything sufficiently list-like and produces a sorted list from it.

def write_results(sorted_list):

sorted_list isn't a great name. It would be better to give an indication of what's in the list.

 fout = open('wordfreq.txt','w')
 fout.write('Word Frequency List\n\n')
 r = 0
 for i in sorted_list:
 r += 1

Use for r, (key, value) in enumerate(sortedlist): That way you don't need to manage r yourself, and you can refer the value as value rather then the harder to read i[1].

 fillamount = 20 - (len(i[1]) + len(str(r)))

Some of the those parens are unnecessary.

 write_str = str(r)+': '+i[1]+' '+('-' * (fillamount-2))+' '+str(i[0])+'\n'

Multiplication has precedence, you don't need the parens to make that happen. Also, python has a method ljust which does this for you

 write_str = (str(r) + ': ' + i[1]).ljust('-', 20) + str(i[0]) + '\n'

You may also want to consider using string formatting rather then adding strings

 write_str = ('%d: %s' % (r, i[1])).just('-', 20) + '%d\n' % i[0]

I think its easier to follow, although I'd probably split across several lines

 fout.write(write_str)
 fout.close()
## The Brothers Grimm
## This file can be obtained from Project Gutenberg:
## http://www.gutenberg.org/cache/epub/5314/pg5314.txt
fin = open('c:\Python27\My Programs\wx\grimm.txt')

fin presumable stands for file in. Give a name that indicates what's actually in it ike grim_text

wordlist = timefunc(process_text,fin)
freqdict = timefunc(create_freq_dict,wordlist)
sorted_list = timefunc(sort_dict,freqdict)
results = timefunc(write_results,sorted_list)

Very nice

webbrowser.open('wordfreq.txt')

That's a slightly unusual use of a webbrowser

print "END"

Question 3

Thanks for the great feedback. I have added my revised code.

Question 4

@talloaktrees, nice work. Two things: 1. the list in created_sorted_list isn't necessary for Python 3 compatibility. The python 3 iterator and python 2 list are similar enough that both work in that situation. In write_results, use for rank, word in enumerate(sorted_list): then you can avoid having to keep track of rank yourself.

Question 5

Yes I accidentally skipped over enumerate(sorted_list) when I was rewriting my code. Also, I forgot to mention that webbrowser.open('wordfreq.txt') was a quick way to open up a text file in the default text editor I found somewhere. I imagine it's a bit hackish.

Question 6

You could turn the timefunc function into a decorator, like this:

def timed(function):
 def wrapped(*args):
 start = time.time()
 data = function(*args)
 end = time.time()
 timetaken = end - start
 print "Function: "+function.__name__+"\nTime taken:",timetaken
 return data
 return wrapped

And then:

@timed
def process_text(filename):
 ...
@timed
def create_freq_dict(wordlist):
 ...

Etc., which would let you call your functions like this:

wordlist = process_text(fin)
freqdict = create_freq_dict(wordlist)
sorted_list = sort_dict(freqdict)
results = write_results(sorted_list)

Winston Ewert Winston Ewert 30.7k4 gold badges52 silver badges79 bronze badges · Accepted Answer · 2012-02-04 21:04:04Z

import string, time, math, webbrowser
def timefunc(function, *args):
 start = time.time()
 data = function(*args)
 end = time.time()
 timetaken = end - start

I'd recommend calling this time_taken as its slightly easier to read.

 print "Function: "+function.__name__+"\nTime taken:",timetaken

Print already introduces newlines and combines different pieces. Take advantage of that.

 print "Function: ", function.__name__
 print "Time Taken: ", time_taken

That's easier to follow

 return data
def process_text(filename):

You never use filename in here, but you do use fin which is the same thing. typo?

 t = []

Not a very descriptive name. I suggest coming up with something clearer

 for line in fin:
 for i in line.split():

i usually means index which its not here

 word = i.lower()
 word = word.strip(string.punctuation)
 if word != '':
 t.append(word)
 return t

I'd do this as

 words = fin.read().lower().split()
 words = [word.strip() for word in words]
 words = [word for word in words if word]
 return words

I think its easier to follow and probably more efficient

def create_freq_dict(wordlist):
 d = dict()

d is not a very good name. Usually dicts are created with {} not dict(). No difference, but the first is generally preffered

 for word in wordlist:
 if word not in d.keys():
 d[word] = 1
 else:
 d[word] += 1

Use d = collections.defaultdict(int) or d = collections.Counter(). Both will make it easier to count up like this. See the python documentation for collections. You should actually be able to write this function in one line

 return d
def sort_dict(in_dict):
 t = []
 for key,value in in_dict.items():
 t.append((value, key))

in_dict.items() is a list already, there is no reason to copy the elements into the list. (NOTE: in Python 3.x in_dict.items() is no longer a list). Even if it wasn't a list you could do:

 t = list(in_dict.items())

Which would do the same thing your code does

 t.sort(reverse=True)
 return t

I'd implement this function as

return sorted(in_dict.items())

The sorted function takes anything sufficiently list-like and produces a sorted list from it.

def write_results(sorted_list):

sorted_list isn't a great name. It would be better to give an indication of what's in the list.

 fout = open('wordfreq.txt','w')
 fout.write('Word Frequency List\n\n')
 r = 0
 for i in sorted_list:
 r += 1

Use for r, (key, value) in enumerate(sortedlist): That way you don't need to manage r yourself, and you can refer the value as value rather then the harder to read i[1].

 fillamount = 20 - (len(i[1]) + len(str(r)))

Some of the those parens are unnecessary.

 write_str = str(r)+': '+i[1]+' '+('-' * (fillamount-2))+' '+str(i[0])+'\n'

Multiplication has precedence, you don't need the parens to make that happen. Also, python has a method ljust which does this for you

 write_str = (str(r) + ': ' + i[1]).ljust('-', 20) + str(i[0]) + '\n'

You may also want to consider using string formatting rather then adding strings

 write_str = ('%d: %s' % (r, i[1])).just('-', 20) + '%d\n' % i[0]

I think its easier to follow, although I'd probably split across several lines

 fout.write(write_str)
 fout.close()
## The Brothers Grimm
## This file can be obtained from Project Gutenberg:
## http://www.gutenberg.org/cache/epub/5314/pg5314.txt
fin = open('c:\Python27\My Programs\wx\grimm.txt')

fin presumable stands for file in. Give a name that indicates what's actually in it ike grim_text

wordlist = timefunc(process_text,fin)
freqdict = timefunc(create_freq_dict,wordlist)
sorted_list = timefunc(sort_dict,freqdict)
results = timefunc(write_results,sorted_list)

Very nice

webbrowser.open('wordfreq.txt')

That's a slightly unusual use of a webbrowser

print "END"

Thanks for the great feedback. I have added my revised code.
@talloaktrees, nice work. Two things: 1. the list in created_sorted_list isn't necessary for Python 3 compatibility. The python 3 iterator and python 2 list are similar enough that both work in that situation. In write_results, use for rank, word in enumerate(sorted_list): then you can avoid having to keep track of rank yourself.
Yes I accidentally skipped over enumerate(sorted_list) when I was rewriting my code. Also, I forgot to mention that webbrowser.open('wordfreq.txt') was a quick way to open up a text file in the default text editor I found somewhere. I imagine it's a bit hackish.

Stack Exchange Network

Word frequency generator in Python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Word frequency generator in Python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions