I just finished working on a coding challenge for learning Python 2.7. Essentially, I'm a function that if it is fed a string such as:
"The man drank the drink and ate the bread and forgot the drink"
I get in return:
{'and': {'ate': 1, 'forgot': 1},
'ate': {'the': 1},
'bread': {'and': 1},
'drank': {'the': 1},
'drink': {'and': 1},
'forgot': {'the': 1},
'man': {'drank': 1},
'the': {'bread': 1, 'drink': 2, 'man': 1}}
In other words, each word (that has a word following it) is a key, and the value is a dictionary of words that come right after, and the number of times that happens. (drink
follows the
twice in the string, hence the 2
value in its dictionary.
Here's the function I wrote to accomplish this end:
def word_counts(f):
#Function to remove punctuation, change to lowercase, etc. from incoming string
def string_clean(file_content):
fc_new = "".join([i.lower() for i in file_content if i not in string.punctuation])
fc_new = fc_new.split()
return fc_new
f = string_clean(f)
unique_f = f[:]
#For next part of function, get the unique words found in string.
#We'll then run each through the string and find words that follow
#Pop() the last word, since nothing follows it
unique_f = list(set(unique_f.pop()))
result = {}
for word in unique_f:
next_word_keeper = {}
for _ in range(0, len(f)-1):
if word == f[_]:
if f[_+1] in next_word_keeper.keys():
next_word_keeper[f[_+1]] = next_word_keeper[f[_+1]] + 1
else:
next_word_keeper[f[_+1]] = 1
result[word] = next_word_keeper
return result
Feedback appreciated, thanks.
1 Answer 1
string.punctuation == string.punctuation.lower()
.- You don't need
string_clean
to be a function as you only use it once. - Don't use
_
as a variable, and definitely don't in a loop, as most use it as a 'garbage' variable. - You can use
f[:-1]
to get the same asu = f[:];u.pop()
- Your algorithm is ok, but can be a bit odd to read.
To improve your code I'd add collections.defaultdict
.
This will allow you to remove the innermost if/else.
This is as if the value isn't in the dictionary it'll default it to something for you.
>>> from collections import defaultdict
>>> next_word_keeper = defaultdict(int)
>>> next_word_keeper['test'] += 1
>>> next_word_keeper
defaultdict(<type 'int'>, {'test': 1})
>>> next_word_keeper['test'] += 1
>>> next_word_keeper
defaultdict(<type 'int'>, {'test': 2})
>>> next_word_keeper['test2'] += 1
>>> next_word_keeper
defaultdict(<type 'int'>, {'test': 2, 'test2': 1})
Using the above should get you:
def word_counts(f):
f = f.lower().split()
unique_f = list(set(f[:-1]))
result = {}
for word in unique_f:
next_word_keeper = defaultdict(int)
for i in range(len(f)-1):
if word == f[i]:
next_word_keeper[f[i + 1]] += 1
result[word] = next_word_keeper
return result
But this code is not the best when it comes to readability and performance!
Instead of going through the list multiple times, you can go though it once.
Using enumerate
we get the current index, and then we can use it to get the next word.
And then using two defaultdict
s we can simplify the function to six lines:
def word_counts(line):
line = line.lower().split()
results = defaultdict(lambda:defaultdict(int))
for i, value in enumerate(line[:-1]):
results[value][line[i + 1]] += 1
return results
You can also go onto use the itertools
pairwise
recipe to further simplify the code.