Can someone review my code which I put on exercism.io for the word-count exercise (found here)?
class Phrase:
def __init__(self, phrase):
self.phrase = phrase.strip().lower().split()
def word_count(self):
word_dict = {}
for word in self.phrase:
word_input = ''.join(e for e in word if e.isalnum()).strip()
if word_input:
word_dict[word_input] = word_dict.get(word_input, 0) + 1
return word_dict
-
\$\begingroup\$ I understand this is a Python exercise, but note that in real text word counting is not that easy. 'fish/meat' should be two words but 'A/C' only one. 'Los Angeles' should possibly be one word, while 'The Angels' are two words. It gets worse when you want to split sentences. nltk is a Python library that can help you with this if you're interested in natural language processing. \$\endgroup\$Quentin Pradet– Quentin Pradet2014年02月18日 09:57:38 +00:00Commented Feb 18, 2014 at 9:57
2 Answers 2
My first advice is stop writing classes if you don't need to : your class has only two methods an init
and a proper method. This could probably be written as a simple function.
def word_count(phrase):
word_dict = {}
for word in phrase.strip().lower().split():
word_input = ''.join(e for e in word if e.isalnum()).strip()
if word_input:
word_dict[word_input] = word_dict.get(word_input, 0) + 1
return word_dict
Also, a lot of string processing you are doing is not useful, complicated and/or potentially wrong :
- On
phrase
, as noticed by Gareth Rees, there's not point callingstrip()
since you callsplit()
.
As I wasn't aware that the result would be the same, here a proof of concept :
a=' this is a little test'
a.split() == a.strip().split()
-> True
and here's a link to the documentation for strip and split.
- On individual words : the way you get
word_input
fromword
looks a bit convoluted to me. Among other things, there's no point in callingstrip()
on a string that only contains alphanumerical characters. Also, just removing "special" characters does not sound like the best option :it's
andits
will be considered the same way, its annoying :-). Maybe some characters like'
should be taken into account during the splitting. Maybe, depending on the language you are to handle, other characters like-
should be kept (for instance in French, "basketball" is "basket-ball" but neither "basket" nor "ball" are French words so splitting would be quite awkward and so would be removing the dash).
Except for that, your code looks go to me!
However, if you wanted to make things even more pythonic, you could use defaultdict. This example will probably look familar to you.
-
2\$\begingroup\$ There's no point calling
.strip()
at all, since you call.split()
just afterwards. \$\endgroup\$Gareth Rees– Gareth Rees2014年02月18日 11:06:43 +00:00Commented Feb 18, 2014 at 11:06
Some notes:
- As already pointed out, an object-oriented approach seems overkill, a simple function will do.
self.phrase
. That's very confusing,phrase
is a string butself.phrase
is a list of strings. A better name would bewords
.
I'd write it with a functional approach using collections.Counter
:
from collections import Counter
def word_count(phrase):
words = phrase.lower().split()
return Counter("".join(c for c in word if c.isalnum()) for word in words)
Or even simpler:
from collections import Counter
import re
def word_count(phrase):
return Counter(re.sub(r"[^\w ]+", '', phrase.lower()).split())
-
\$\begingroup\$ This doesn't work on none ASCII characters such as:
досвидания!
\$\endgroup\$YoYoYo I'm Awesome– YoYoYo I'm Awesome2016年08月29日 19:25:40 +00:00Commented Aug 29, 2016 at 19:25