Histogram word counter in Python

Question 1

I would like to ask for code review for my Python histogram word counter.

# Given a body of text, return a hash table of the frequency of each word.
"""
# I use a hash map as my data structure to create the histogram
# and add words into the dictionary
# if words are not in the dictionary, then I add those word into the dictionary
# final output is that we return the dictionary
"""
# Word Counter
# Given an body of text, return a hash table of the frequency of each word.
# Parameters
# Input: text {String}
# Output: {Hash Table}
# Constraints
# Capital and lower case versions of the same word should be counted is the same word.
# Remove punctuations from all words.
# Time: O(N)
# Space: O(N)
# Where N is the number of characters in the string.
# Examples
# 'The cat and the hat.' --> '{ the: 2, cat: 1, and: 1, hat: 1 }'`
# 'As soon as possible.' --> '{ as: 2, soon: 1, possible: 1 }'`
# 'It's a man, it's a plane, it's superman!' --> '{ its: 3, a: 2, man: 1, plane: 1, superman: 1 }'`
def word_count(sentence):
 word_counter = {}
 wordlist = sentence.lower().split()
 for word in wordlist:
 word = re.sub('[.,:*! ]', '', word)
 if word in word_counter:
 word_counter[word] += 1
 else:
 word_counter[word] = 1
 return word_counter
example = word_count("It's a man, it's a plane, it's superman!")

Question 2

Several things can be improved in your existing code. First and foremost being, replacing your usage of re module. Regex matching is very heavy. You have a defined set of characters being replaced. Use str.replace or str.translate.

You have provided comments aptly throughout the program. However, you can place them as docstring for the function, including the examples as doctests. There are several syntaxes to format docstrings. I am mostly aware of epytext and reStructuredText. You can search around for other common formats :).

You'll end up with:

"""
Word Counter
Given an body of text, return a hash table of the frequency of each
word.
"""
def word_count(sentence):
 """
 Word Counter
 Given an body of text, return a hash table of the frequency of
 each word.
 .. warnings::
 - Capital and lower case versions of the same word should be counted
 as the same word.
 - Remove punctuations from all words.
 .. note::
 Where N is the number of characters in the string.
 - Time: O(N)
 - Space: O(N)
 :Example:
 >>> word_count('The cat and the hat.')
 {'the': 2, 'cat': 1, 'and': 1, 'hat': 1}
 >>> word_count('As soon as possible.')
 {'as': 2, 'soon': 1, 'possible': 1}
 >>> word_count("It's a man, it's a plane, it's superman!")
 {'its': 3, 'a': 2, 'man': 1, 'plane': 1, 'superman': 1}
 :param sentence: Input string
 :type sentence: str
 :return: Returns hash-table of frequence of each word in input
 :rtype: dict
 """
 translate = sentence.maketrans({char: None for char in "'.,:*!"})
 cleaned_words = sentence.lower().translate(translate).split()
 word_counter = {}
 for word in cleaned_words:
 if word in word_counter:
 word_counter[word] += 1
 else:
 word_counter[word] = 1
 return word_counter

You can also make use of collections.counter which has the same complexity as your current code. However, it also provides a few extra features on the resulting counter object, such as most_common.

_{The referenced links above are for python-2.7; however all the modules/packages are available in python 3.x}

Question 3

Just curious how common/pythonic it is to have docstrings longer than the function itself?

Question 4

@BruceWayne For library functions it's quite common. Anything more top-level than that, not so much. But it doesn't hurt. Zen of Python: explicit is better than implicit.

Question 5

@BruceWayne As an example: boltons.readthedocs.io/en/latest/_modules/boltons/… see various function that the module exposes.

Question 6

I think that .translate(None, "'.,:*!") is more succinct and accomplishes the same thing as using a translation map.

Question 7

@jon they changed the behaviour in python 3 for .translate

Question 8

In addition to hjpotter92's answer you can improve your counting using dict.get()

for word in cleaned_words:
 word_counter[word] = word_counter.get(word, 0) + 1

dict.get(key, default)checks for the key in dict and returns default in case key is not in dict. Makes 1 line out of 4 and improves readability quite a bit. For sure using collections.counter is also a good approach - but it includes importing another package.

Question 9

dict.get completely slipped my mind :) That is indeed cleaner than if-else block

Question 10

Better yet, use defaultdict.

Question 11

True, especially for larger lists to count it seems to outperform the dict.get() approach. Have to get used to not only using it for grouping :)

Question 12

hjpotter92 has covered most things but an alternative to RandomDude's dict insertion improvement is to use a defaultdict

from collections import defaultdict
word_counter = defaultdict(int)
word_counter[word] += 1

An attempt to access a non-existent key will automatically initialise it to 0. Possibly a bit clearer than RandomDude's but it's a matter of preference.

score 13 · Accepted Answer · 2018-07-25 18:52:23Z

Several things can be improved in your existing code. First and foremost being, replacing your usage of re module. Regex matching is very heavy. You have a defined set of characters being replaced. Use str.replace or str.translate.

You have provided comments aptly throughout the program. However, you can place them as docstring for the function, including the examples as doctests. There are several syntaxes to format docstrings. I am mostly aware of epytext and reStructuredText. You can search around for other common formats :).

You'll end up with:

"""
Word Counter
Given an body of text, return a hash table of the frequency of each
word.
"""
def word_count(sentence):
 """
 Word Counter
 Given an body of text, return a hash table of the frequency of
 each word.
 .. warnings::
 - Capital and lower case versions of the same word should be counted
 as the same word.
 - Remove punctuations from all words.
 .. note::
 Where N is the number of characters in the string.
 - Time: O(N)
 - Space: O(N)
 :Example:
 >>> word_count('The cat and the hat.')
 {'the': 2, 'cat': 1, 'and': 1, 'hat': 1}
 >>> word_count('As soon as possible.')
 {'as': 2, 'soon': 1, 'possible': 1}
 >>> word_count("It's a man, it's a plane, it's superman!")
 {'its': 3, 'a': 2, 'man': 1, 'plane': 1, 'superman': 1}
 :param sentence: Input string
 :type sentence: str
 :return: Returns hash-table of frequence of each word in input
 :rtype: dict
 """
 translate = sentence.maketrans({char: None for char in "'.,:*!"})
 cleaned_words = sentence.lower().translate(translate).split()
 word_counter = {}
 for word in cleaned_words:
 if word in word_counter:
 word_counter[word] += 1
 else:
 word_counter[word] = 1
 return word_counter

You can also make use of collections.counter which has the same complexity as your current code. However, it also provides a few extra features on the resulting counter object, such as most_common.

_{The referenced links above are for python-2.7; however all the modules/packages are available in python 3.x}

Just curious how common/pythonic it is to have docstrings longer than the function itself?
@BruceWayne For library functions it's quite common. Anything more top-level than that, not so much. But it doesn't hurt. Zen of Python: explicit is better than implicit.
@BruceWayne As an example: boltons.readthedocs.io/en/latest/_modules/boltons/… see various function that the module exposes.
I think that .translate(None, "'.,:*!") is more succinct and accomplishes the same thing as using a translation map.
@jon they changed the behaviour in python 3 for .translate

Stack Exchange Network

Histogram word counter in Python

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Histogram word counter in Python

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions