Counting all words from text and ordering it

Question 1

I did it and it works as intended but it seemed very inefficient. I made the string a list, then I standardize it (removing accents, commas and points). After that, I count every word using a dictionary, then sort it in a loop.

Is there a better way to do it?

Edit - Thanks to you both, with your tips I managed to improve the code a lot, it's much faster and efficient now: https://pastebin.com/EN74daBG

import unidecode
from operator import itemgetter
def word_counter(text):
 counter = {}
 for word in text.lower().replace(",", "").replace(".", "").split():
 standardize = unidecode.unidecode(word)
 if standardize.isalnum():
 counter.setdefault(standardize, 0)
 counter[standardize] += 1
 for key, value in sorted(counter.items(), key=itemgetter(1), reverse=True):
 print("{} = {}".format(key, value))
word_counter('''text here''')

Question 2

All in all this is not bad.

Split into functions

I would split this into more functions.

to generate a stream of words
to do the count
for the presentation

splitting the text

This is a simple generator

def text_split(text):
 text = text.lower().replace(",", "").replace(".", "")
 for word in text.split():
 yield unidecode.unidecode(word)

You can generalize this a bit using re and string.punctuation:

import re
import string
PUNCTUATION = re.compile(rf'[{string.punctuation}]')
def text_split_re(text):
 text = PUNCTUATION.sub('', text.lower())
 for word in text.split():
 yield unidecode.unidecode(word)

This removes all punctuation in 1 go.

Counter

you use dict.setdefault, so you read the documentation. If you had looked a tiny bit further, in the collections module, you'd have found Counter, which is exactly what you need, especially with it's most_common method. This is so handy, you don't really need the second method. You just need to feed it to the Counter.

presentation

def print_result(word_count: Counter):
 for word, count in word_count.most_common():
 print(f'{word} = {count}')

as simple as:

putting it together

from collections import Counter 
if __name__ == '__main__':
 words = text_split_re('''text here''')
 word_count = Counter(words)
 print_result(word_count)

tests

splitting this in parts also allows you to unit-test each of the parts

assert list(text_split_re('text here')) == ['text', 'here']
assert list(text_split_re('Text here')) == ['text', 'here']
assert list(text_split_re('Text he.re')) == ['text', 'here']
assert list(text_split_re('''Text 
 here''')) == ['text', 'here']

Question 3

You've put your logic into a function which is great for reusability and testing, but you call your function regardless of whether you run the script or you import it: get yourself familiar with the if __name__ == '__main__': guard.

Now for the text processing part: you normalize your text using two different approaches:

lower + replace on the whole text
unidecode on single words

Instead, I would suggest doing the whole normalization on a per-word basis. This is twofold:

You avoid duplicating the entire text in memory 3 times in a row, only each word;
You can improve the function by accepting a stream of words instead of the entire text at once.

You can also improve this normalization process using str.translate to remove all the punctuation at once.

Then you can map this function over all words, filter them and count them more efficiently:

import sys
import unicodedata
from collections import Counter
import unidecode
REMOVE_PUNCTUATION = dict.fromkeys(
 i for i in range(sys.maxunicode)
 if unicodedata.category(chr(i)).startswith('P')
)
def normalize(word):
 return unidecode.unidecode(word.translate(REMOVE_PUNCTUATION)).lower()
def word_counter(words_stream):
 return Counter(filter(str.isalnum, map(normalize, words_stream)))

Now you could call you function using whatever stream suits your needs:

if __name__ == '__main__':
 count = word_counter('''text here'''.split())
 print(count)

Or, more memory friendly:

def read_file_word_by_word(filename):
 with open(filename) as f:
 for line in f:
 yield from line.split()
if __name__ == '__main__':
 count = word_counter(read_file_word_by_word('the_file_name.txt'))
 print(count)

Maarten Fabré Maarten Fabré 9,3901 gold badge15 silver badges27 bronze badges · Accepted Answer · 2018-06-18 07:58:50Z

All in all this is not bad.

Split into functions

I would split this into more functions.

to generate a stream of words
to do the count
for the presentation

splitting the text

This is a simple generator

def text_split(text):
 text = text.lower().replace(",", "").replace(".", "")
 for word in text.split():
 yield unidecode.unidecode(word)

You can generalize this a bit using re and string.punctuation:

import re
import string
PUNCTUATION = re.compile(rf'[{string.punctuation}]')
def text_split_re(text):
 text = PUNCTUATION.sub('', text.lower())
 for word in text.split():
 yield unidecode.unidecode(word)

This removes all punctuation in 1 go.

Counter

you use dict.setdefault, so you read the documentation. If you had looked a tiny bit further, in the collections module, you'd have found Counter, which is exactly what you need, especially with it's most_common method. This is so handy, you don't really need the second method. You just need to feed it to the Counter.

presentation

def print_result(word_count: Counter):
 for word, count in word_count.most_common():
 print(f'{word} = {count}')

as simple as:

putting it together

from collections import Counter 
if __name__ == '__main__':
 words = text_split_re('''text here''')
 word_count = Counter(words)
 print_result(word_count)

tests

splitting this in parts also allows you to unit-test each of the parts

assert list(text_split_re('text here')) == ['text', 'here']
assert list(text_split_re('Text here')) == ['text', 'here']
assert list(text_split_re('Text he.re')) == ['text', 'here']
assert list(text_split_re('''Text 
 here''')) == ['text', 'here']

Stack Exchange Network

Counting all words from text and ordering it

2 Answers 2

Split into functions

splitting the text

Counter

presentation

putting it together

tests

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Counting all words from text and ordering it

2 Answers 2

Split into functions

splitting the text

Counter

presentation

putting it together

tests

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions