I wrote a code that given a string (txt
), will print a histogram that contains the number of occurrences of each word in it and the word itself.
A word is defined to be a sequence of chars separated by one space or more.
The code is working, I just want to see if I can improve it.
def PrintWordsOccurence(txt):
lst = [x for x in txt.split(' ') if x != ' ' and x != '']
newlst = list(dict.fromkeys(lst))
for item in newlst:
print("[{0}] {1}".format(lst.count(item),item))
An input-output example:
>> PrintWordsOccurence("hello code review! this is my code")
[1] hello
[2] code
[1] review!
[1] this
[1] is
[1] my
The first line creates a list with all words, there will be duplicates in it, so to remove them, I turn the list into a dictionary and then I turn it back to a list. In the end, for each item in the list, I print the number of its occurrences in the original string and the item itself.
The time complexity of the code is \$\Theta(n^2)\$
Any suggestions to improve the code?
2 Answers 2
str.split method without arguments
It does the same filtering as you do. Just skip ' ' argument.
list(dict.fromkeys(list comprehension))
It looks overburdened, right? You can create a set of list comprehension as well:
newlst = list({x for x in txt.split()})
or just
newlst = list(set(txt.split()))
but the best way is
collections.Counter
You're reinventing it with worse time complexity.
Format strings
f-strings look better (but it depends on your task).
Function name
PEP8 advices to use lower case with underscores: print_words_occurence
All together
from collections import Counter
def print_words_occurence(txt):
for word, count in Counter(txt.split()).items():
print(f"[{count}] {word}")
Also consider dividing an algorithmic part and input/output - like yielding a pair (word, count) from the function and outputting it somewhere else.
This can be improved by implementing a two-line, basic counting sort in core python:
def print_words_occurence(txt):
wdict = dict()
for w in txt.strip().split():
wdict[w] = wdict.get(w, 0) + 1
# just to print results
[print(f"[{wdict[k]}] {k}") for k in wdict.keys()]
The run time is O(n) if the final dict isn't sorted. [If it is sorted, it's still O(n+k).] I don't think anything can be more efficient.