10
\$\begingroup\$

I would like to ask for code review for my Python histogram word counter.

# Given a body of text, return a hash table of the frequency of each word.
"""
# I use a hash map as my data structure to create the histogram
# and add words into the dictionary
# if words are not in the dictionary, then I add those word into the dictionary
# final output is that we return the dictionary
"""
# Word Counter
# Given an body of text, return a hash table of the frequency of each word.
# Parameters
# Input: text {String}
# Output: {Hash Table}
# Constraints
# Capital and lower case versions of the same word should be counted is the same word.
# Remove punctuations from all words.
# Time: O(N)
# Space: O(N)
# Where N is the number of characters in the string.
# Examples
# 'The cat and the hat.' --> '{ the: 2, cat: 1, and: 1, hat: 1 }'`
# 'As soon as possible.' --> '{ as: 2, soon: 1, possible: 1 }'`
# 'It's a man, it's a plane, it's superman!' --> '{ its: 3, a: 2, man: 1, plane: 1, superman: 1 }'`
def word_count(sentence):
 word_counter = {}
 wordlist = sentence.lower().split()
 for word in wordlist:
 word = re.sub('[.,:*! ]', '', word)
 if word in word_counter:
 word_counter[word] += 1
 else:
 word_counter[word] = 1
 return word_counter
example = word_count("It's a man, it's a plane, it's superman!")
Toby Speight
87.1k14 gold badges104 silver badges322 bronze badges
asked Jul 25, 2018 at 17:41
\$\endgroup\$

3 Answers 3

13
\$\begingroup\$

Several things can be improved in your existing code. First and foremost being, replacing your usage of re module. Regex matching is very heavy. You have a defined set of characters being replaced. Use str.replace or str.translate.

You have provided comments aptly throughout the program. However, you can place them as docstring for the function, including the examples as doctests. There are several syntaxes to format docstrings. I am mostly aware of epytext and reStructuredText. You can search around for other common formats :).

You'll end up with:

"""
Word Counter
Given an body of text, return a hash table of the frequency of each
word.
"""
def word_count(sentence):
 """
 Word Counter
 Given an body of text, return a hash table of the frequency of
 each word.
 .. warnings::
 - Capital and lower case versions of the same word should be counted
 as the same word.
 - Remove punctuations from all words.
 .. note::
 Where N is the number of characters in the string.
 - Time: O(N)
 - Space: O(N)
 :Example:
 >>> word_count('The cat and the hat.')
 {'the': 2, 'cat': 1, 'and': 1, 'hat': 1}
 >>> word_count('As soon as possible.')
 {'as': 2, 'soon': 1, 'possible': 1}
 >>> word_count("It's a man, it's a plane, it's superman!")
 {'its': 3, 'a': 2, 'man': 1, 'plane': 1, 'superman': 1}
 :param sentence: Input string
 :type sentence: str
 :return: Returns hash-table of frequence of each word in input
 :rtype: dict
 """
 translate = sentence.maketrans({char: None for char in "'.,:*!"})
 cleaned_words = sentence.lower().translate(translate).split()
 word_counter = {}
 for word in cleaned_words:
 if word in word_counter:
 word_counter[word] += 1
 else:
 word_counter[word] = 1
 return word_counter

You can also make use of collections.counter which has the same complexity as your current code. However, it also provides a few extra features on the resulting counter object, such as most_common.


The referenced links above are for python-2.7; however all the modules/packages are available in python 3.x

answered Jul 25, 2018 at 18:52
\$\endgroup\$
5
  • \$\begingroup\$ Just curious how common/pythonic it is to have docstrings longer than the function itself? \$\endgroup\$ Commented Jul 26, 2018 at 2:26
  • 2
    \$\begingroup\$ @BruceWayne For library functions it's quite common. Anything more top-level than that, not so much. But it doesn't hurt. Zen of Python: explicit is better than implicit. \$\endgroup\$ Commented Jul 26, 2018 at 5:34
  • 1
    \$\begingroup\$ @BruceWayne As an example: boltons.readthedocs.io/en/latest/_modules/boltons/… see various function that the module exposes. \$\endgroup\$ Commented Jul 26, 2018 at 7:35
  • \$\begingroup\$ I think that .translate(None, "'.,:*!") is more succinct and accomplishes the same thing as using a translation map. \$\endgroup\$ Commented Jul 26, 2018 at 21:22
  • \$\begingroup\$ @jon they changed the behaviour in python 3 for .translate \$\endgroup\$ Commented Jul 26, 2018 at 21:34
5
\$\begingroup\$

In addition to hjpotter92's answer you can improve your counting using dict.get()

for word in cleaned_words:
 word_counter[word] = word_counter.get(word, 0) + 1

dict.get(key, default)checks for the key in dict and returns default in case key is not in dict. Makes 1 line out of 4 and improves readability quite a bit. For sure using collections.counter is also a good approach - but it includes importing another package.

answered Jul 25, 2018 at 20:01
\$\endgroup\$
3
  • \$\begingroup\$ dict.get completely slipped my mind :) That is indeed cleaner than if-else block \$\endgroup\$ Commented Jul 25, 2018 at 21:09
  • 1
    \$\begingroup\$ Better yet, use defaultdict. \$\endgroup\$ Commented Jul 26, 2018 at 0:11
  • \$\begingroup\$ True, especially for larger lists to count it seems to outperform the dict.get() approach. Have to get used to not only using it for grouping :) \$\endgroup\$ Commented Jul 26, 2018 at 7:12
3
\$\begingroup\$

hjpotter92 has covered most things but an alternative to RandomDude's dict insertion improvement is to use a defaultdict

from collections import defaultdict
word_counter = defaultdict(int)
word_counter[word] += 1

An attempt to access a non-existent key will automatically initialise it to 0. Possibly a bit clearer than RandomDude's but it's a matter of preference.

answered Jul 26, 2018 at 9:52
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.