I did it and it works as intended but it seemed very inefficient. I made the string a list, then I standardize it (removing accents, commas and points). After that, I count every word using a dictionary, then sort it in a loop.
Is there a better way to do it?
Edit - Thanks to you both, with your tips I managed to improve the code a lot, it's much faster and efficient now: https://pastebin.com/EN74daBG
import unidecode
from operator import itemgetter
def word_counter(text):
counter = {}
for word in text.lower().replace(",", "").replace(".", "").split():
standardize = unidecode.unidecode(word)
if standardize.isalnum():
counter.setdefault(standardize, 0)
counter[standardize] += 1
for key, value in sorted(counter.items(), key=itemgetter(1), reverse=True):
print("{} = {}".format(key, value))
word_counter('''text here''')
2 Answers 2
All in all this is not bad.
Split into functions
I would split this into more functions.
- to generate a stream of words
- to do the count
- for the presentation
splitting the text
This is a simple generator
def text_split(text):
text = text.lower().replace(",", "").replace(".", "")
for word in text.split():
yield unidecode.unidecode(word)
You can generalize this a bit using re
and string.punctuation
:
import re
import string
PUNCTUATION = re.compile(rf'[{string.punctuation}]')
def text_split_re(text):
text = PUNCTUATION.sub('', text.lower())
for word in text.split():
yield unidecode.unidecode(word)
This removes all punctuation in 1 go.
Counter
you use dict.setdefault
, so you read the documentation. If you had looked a tiny bit further, in the collections
module, you'd have found Counter
, which is exactly what you need, especially with it's most_common
method. This is so handy, you don't really need the second method. You just need to feed it to the Counter
.
presentation
def print_result(word_count: Counter):
for word, count in word_count.most_common():
print(f'{word} = {count}')
as simple as:
putting it together
from collections import Counter
if __name__ == '__main__':
words = text_split_re('''text here''')
word_count = Counter(words)
print_result(word_count)
tests
splitting this in parts also allows you to unit-test each of the parts
assert list(text_split_re('text here')) == ['text', 'here']
assert list(text_split_re('Text here')) == ['text', 'here']
assert list(text_split_re('Text he.re')) == ['text', 'here']
assert list(text_split_re('''Text
here''')) == ['text', 'here']
You've put your logic into a function which is great for reusability and testing, but you call your function regardless of whether you run the script or you import it: get yourself familiar with the if __name__ == '__main__':
guard.
Now for the text processing part: you normalize your text using two different approaches:
lower
+replace
on the whole textunidecode
on single words
Instead, I would suggest doing the whole normalization on a per-word basis. This is twofold:
- You avoid duplicating the entire text in memory 3 times in a row, only each word;
- You can improve the function by accepting a stream of words instead of the entire text at once.
You can also improve this normalization process using str.translate
to remove all the punctuation at once.
Then you can map
this function over all words, filter
them and count them more efficiently:
import sys
import unicodedata
from collections import Counter
import unidecode
REMOVE_PUNCTUATION = dict.fromkeys(
i for i in range(sys.maxunicode)
if unicodedata.category(chr(i)).startswith('P')
)
def normalize(word):
return unidecode.unidecode(word.translate(REMOVE_PUNCTUATION)).lower()
def word_counter(words_stream):
return Counter(filter(str.isalnum, map(normalize, words_stream)))
Now you could call you function using whatever stream suits your needs:
if __name__ == '__main__':
count = word_counter('''text here'''.split())
print(count)
Or, more memory friendly:
def read_file_word_by_word(filename):
with open(filename) as f:
for line in f:
yield from line.split()
if __name__ == '__main__':
count = word_counter(read_file_word_by_word('the_file_name.txt'))
print(count)