I'm teaching myself Python and when a friend posted this sentence
Only the fool would take trouble to verify that his sentence was composed of ten a's, three b's, four c's, four d's, forty-six e's, sixteen f's, four g's, thirteen h's, fifteen i's, two k's, nine l's, four m's, twenty-five n's, twenty-four o's, five p's, sixteen r's, forty-one s's, thirty-seven t's, ten u's, eight v's, eight w's, four x's, eleven y's, twenty-seven commas, twenty-three apostrophes, seven hyphens and, last but not least, a single !
I thought, as a fool, I would try to verify it by plotting a histogram. This is my code:
import matplotlib.pyplot as plt
import numpy as np
sentence = "Only the fool would take trouble to verify that his sentence was composed of ten a's, three b's, four c's, four d's, forty-six e's, sixteen f's, four g's, thirteen h's, fifteen i's, two k's, nine l's, four m's, twenty-five n's, twenty-four o's, five p's, sixteen r's, forty-one s's, thirty-seven t's, ten u's, eight v's, eight w's, four x's, eleven y's, twenty-seven commas, twenty-three apostrophes, seven hyphens and, last but not least, a single !".lower()
# Convert the string to an array of integers
numbers = np.array([ord(c) for c in sentence])
u = np.unique(numbers)
# Make the integers range from 0 to n so there are no gaps in the histogram
# [0][0] was a hack to make sure `np.where` returned an int instead of an array.
ind = [np.where(u==n)[0][0] for n in numbers]
bins = range(0,len(u)+1)
hist, bins = np.histogram(ind, bins)
plt.bar(bins[:-1], hist, align='center')
plt.xticks(np.unique(ind), [str(unichr(n)) for n in set(numbers)])
plt.grid()
plt.show()
Which generates
Please let me know how to improve my code. Also, please let me know what I did wrong with plt.xticks
that resulted in the gaps at the beginning and the end (or is that just a case of incorrect axis limits?).
2 Answers 2
Your code is pretty good! I have only one substantive and a few stylistic suggestion.
Style
- Since
sentence
is a hard-coded variable, Python convention is that it should be in all-uppercase, i.e.SENTENCE
is a better variable name. - What are
u
andn
in your code? It's hard to figure out what those variables mean. Could you be more descriptive with your naming? - Your call to
.lower()
onsentence
is hidden after the very long sentence. For readability I wouldn't hide any function calls at the end of very long strings. - Python has multi-line string support using the
"""
delimiters. Using it makes the sentence and the code more readable, although at the expense of introducing newline\n
characters that would show up on the histogram if they are not removed. In my code below I use the"""
delimiter and remove the\n
characters I introduced to break the string into screen-width-sized chunks. PEP8 convention is that code lines shouldn't be more than about 80 characters long. - You should consider breaking this code up into two functions, one to make generate the data, and one to make the graph, but we can leave that for another time.
Substance
- Since your sentence is a Python string (not a NumPy character array), you can generate the data for your histogram quite easily by using the
Counter
data type that is available in thecollections
module. It's designed for exactly applications like this. Doing so will let you avoid the complications of bin edges vs. bin centers that stem from usingnp.histogram
entirely.
Putting all these ideas together:
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
SENTENCE = """Only the fool would take trouble to verify that his sentence was composed of ten a's, three b's, four c's,
four d's, forty-six e's, sixteen f's, four g's, thirteen h's, fifteen i's, two k's, nine l's, four m's, twenty-five n's,
twenty-four o's, five p's, sixteen r's, forty-one s's, thirty-seven t's, ten u's, eight v's, eight w's, four x's,
eleven y's, twenty-seven commas, twenty-three apostrophes, seven hyphens and, last but not least, a single !"""
# generate histogram
letters_hist = Counter(SENTENCE.lower().replace('\n', ''))
counts = letters_hist.values()
letters = letters_hist.keys()
# graph data
bar_x_locations = np.arange(len(counts))
plt.bar(bar_x_locations, counts, align = 'center')
plt.xticks(bar_x_locations, letters)
plt.grid()
plt.show()
Other
It wasn't anything you did with plt.xticks
that led to the gaps. That's the matplotlib default. If you want a "tight" border to the graph, try adding a plt.xlim(-0.5, len(counts) - 0.5)
before the plt.show()
.
-
\$\begingroup\$ Excellent, thanks! I tried to use the multiline string, but didn't get as far as using replace. \$\endgroup\$Dan– Dan2016年05月27日 17:15:38 +00:00Commented May 27, 2016 at 17:15
-
2\$\begingroup\$
S = ('a ' 'b ' 'c')
is exactly the same asS = 'a b c'
. But using the former syntax you can use implicit line continuation within parenthesis and make a "single-line" string span on multiple line. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年05月28日 20:16:08 +00:00Commented May 28, 2016 at 20:16 -
1\$\begingroup\$ Good tip Mathias. Still probably best approach is probably writing the code to read the
sentence
from a separate text file. Code and data rarely belong together. \$\endgroup\$Curt F.– Curt F.2016年05月28日 20:48:28 +00:00Commented May 28, 2016 at 20:48
There are few improvements that could be suggested, specially to vectorize things and use existing function to do bulk of operations. Those are listed below :
You are already using
np.unique
on numbers to get u :u = np.unique(numbers)
. Nownp.unique
also has an optional argument to return counts asreturn_count
. This should handle the intended binning operation.Rest of the work is all about creating the x-axis to cover all characters. For those, we can keep most of the existing code.
So, finally we would have an implementation like so -
# Get the IDs corresponding to each input character in input sentence
numbers = np.array([ord(c) for c in sentence])
# Performing counting/binning and also setup x-axis IDs for plotting
hist = np.unique(numbers,return_counts=True)[1]
bins = np.arange(0,hist.size)
# Finally, plot the results
plt.bar(bins, hist, align='center')
plt.xticks(bins, [str(unichr(n)) for n in set(numbers)])
plt.grid()
plt.show()
Explore related questions
See similar questions with these tags.