I would like to ask for code review for my Python histogram word counter.
# Given a body of text, return a hash table of the frequency of each word.
"""
# I use a hash map as my data structure to create the histogram
# and add words into the dictionary
# if words are not in the dictionary, then I add those word into the dictionary
# final output is that we return the dictionary
"""
# Word Counter
# Given an body of text, return a hash table of the frequency of each word.
# Parameters
# Input: text {String}
# Output: {Hash Table}
# Constraints
# Capital and lower case versions of the same word should be counted is the same word.
# Remove punctuations from all words.
# Time: O(N)
# Space: O(N)
# Where N is the number of characters in the string.
# Examples
# 'The cat and the hat.' --> '{ the: 2, cat: 1, and: 1, hat: 1 }'`
# 'As soon as possible.' --> '{ as: 2, soon: 1, possible: 1 }'`
# 'It's a man, it's a plane, it's superman!' --> '{ its: 3, a: 2, man: 1, plane: 1, superman: 1 }'`
def word_count(sentence):
word_counter = {}
wordlist = sentence.lower().split()
for word in wordlist:
word = re.sub('[.,:*! ]', '', word)
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
return word_counter
example = word_count("It's a man, it's a plane, it's superman!")
3 Answers 3
Several things can be improved in your existing code. First and foremost being, replacing your usage of re
module. Regex matching is very heavy. You have a defined set of characters being replaced. Use str.replace
or str.translate
.
You have provided comments aptly throughout the program. However, you can place them as docstring for the function, including the examples as doctest
s. There are several syntaxes to format docstrings. I am mostly aware of epytext and reStructuredText. You can search around for other common formats :).
You'll end up with:
"""
Word Counter
Given an body of text, return a hash table of the frequency of each
word.
"""
def word_count(sentence):
"""
Word Counter
Given an body of text, return a hash table of the frequency of
each word.
.. warnings::
- Capital and lower case versions of the same word should be counted
as the same word.
- Remove punctuations from all words.
.. note::
Where N is the number of characters in the string.
- Time: O(N)
- Space: O(N)
:Example:
>>> word_count('The cat and the hat.')
{'the': 2, 'cat': 1, 'and': 1, 'hat': 1}
>>> word_count('As soon as possible.')
{'as': 2, 'soon': 1, 'possible': 1}
>>> word_count("It's a man, it's a plane, it's superman!")
{'its': 3, 'a': 2, 'man': 1, 'plane': 1, 'superman': 1}
:param sentence: Input string
:type sentence: str
:return: Returns hash-table of frequence of each word in input
:rtype: dict
"""
translate = sentence.maketrans({char: None for char in "'.,:*!"})
cleaned_words = sentence.lower().translate(translate).split()
word_counter = {}
for word in cleaned_words:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
return word_counter
You can also make use of collections.counter
which has the same complexity as your current code. However, it also provides a few extra features on the resulting counter object, such as most_common
.
The referenced links above are for python-2.7; however all the modules/packages are available in python 3.x
-
\$\begingroup\$ Just curious how common/pythonic it is to have docstrings longer than the function itself? \$\endgroup\$BruceWayne– BruceWayne2018年07月26日 02:26:39 +00:00Commented Jul 26, 2018 at 2:26
-
2\$\begingroup\$ @BruceWayne For library functions it's quite common. Anything more top-level than that, not so much. But it doesn't hurt. Zen of Python: explicit is better than implicit. \$\endgroup\$2018年07月26日 05:34:22 +00:00Commented Jul 26, 2018 at 5:34
-
1\$\begingroup\$ @BruceWayne As an example: boltons.readthedocs.io/en/latest/_modules/boltons/… see various function that the module exposes. \$\endgroup\$hjpotter92– hjpotter922018年07月26日 07:35:38 +00:00Commented Jul 26, 2018 at 7:35
-
\$\begingroup\$ I think that
.translate(None, "'.,:*!")
is more succinct and accomplishes the same thing as using a translation map. \$\endgroup\$Jon Claus– Jon Claus2018年07月26日 21:22:32 +00:00Commented Jul 26, 2018 at 21:22 -
\$\begingroup\$ @jon they changed the behaviour in python 3 for
.translate
\$\endgroup\$hjpotter92– hjpotter922018年07月26日 21:34:22 +00:00Commented Jul 26, 2018 at 21:34
In addition to hjpotter92's answer you can improve your counting using dict.get()
for word in cleaned_words:
word_counter[word] = word_counter.get(word, 0) + 1
dict.get(key, default)
checks for the key in dict and returns default in case key is not in dict.
Makes 1 line out of 4 and improves readability quite a bit. For sure using collections.counter
is also a good approach - but it includes importing another package.
-
\$\begingroup\$
dict.get
completely slipped my mind :) That is indeed cleaner than if-else block \$\endgroup\$hjpotter92– hjpotter922018年07月25日 21:09:02 +00:00Commented Jul 25, 2018 at 21:09 -
1\$\begingroup\$ Better yet, use
defaultdict
. \$\endgroup\$jpmc26– jpmc262018年07月26日 00:11:39 +00:00Commented Jul 26, 2018 at 0:11 -
\$\begingroup\$ True, especially for larger lists to count it seems to outperform the
dict.get()
approach. Have to get used to not only using it forgrouping
:) \$\endgroup\$RandomDude– RandomDude2018年07月26日 07:12:12 +00:00Commented Jul 26, 2018 at 7:12
hjpotter92 has covered most things but an alternative to RandomDude's dict insertion improvement is to use a defaultdict
from collections import defaultdict
word_counter = defaultdict(int)
word_counter[word] += 1
An attempt to access a non-existent key will automatically initialise it to 0. Possibly a bit clearer than RandomDude's but it's a matter of preference.