Histogram of a string

Question 1

I'm teaching myself Python and when a friend posted this sentence

Only the fool would take trouble to verify that his sentence was composed of ten a's, three b's, four c's, four d's, forty-six e's, sixteen f's, four g's, thirteen h's, fifteen i's, two k's, nine l's, four m's, twenty-five n's, twenty-four o's, five p's, sixteen r's, forty-one s's, thirty-seven t's, ten u's, eight v's, eight w's, four x's, eleven y's, twenty-seven commas, twenty-three apostrophes, seven hyphens and, last but not least, a single !

I thought, as a fool, I would try to verify it by plotting a histogram. This is my code:

import matplotlib.pyplot as plt
import numpy as np
sentence = "Only the fool would take trouble to verify that his sentence was composed of ten a's, three b's, four c's, four d's, forty-six e's, sixteen f's, four g's, thirteen h's, fifteen i's, two k's, nine l's, four m's, twenty-five n's, twenty-four o's, five p's, sixteen r's, forty-one s's, thirty-seven t's, ten u's, eight v's, eight w's, four x's, eleven y's, twenty-seven commas, twenty-three apostrophes, seven hyphens and, last but not least, a single !".lower()
# Convert the string to an array of integers
numbers = np.array([ord(c) for c in sentence])
u = np.unique(numbers)
# Make the integers range from 0 to n so there are no gaps in the histogram
# [0][0] was a hack to make sure `np.where` returned an int instead of an array.
ind = [np.where(u==n)[0][0] for n in numbers]
bins = range(0,len(u)+1)
hist, bins = np.histogram(ind, bins)
plt.bar(bins[:-1], hist, align='center')
plt.xticks(np.unique(ind), [str(unichr(n)) for n in set(numbers)])
plt.grid()
plt.show()

Which generates

enter image description here

Please let me know how to improve my code. Also, please let me know what I did wrong with plt.xticks that resulted in the gaps at the beginning and the end (or is that just a case of incorrect axis limits?).

Question 2

Your code is pretty good! I have only one substantive and a few stylistic suggestion.

Style

Since sentence is a hard-coded variable, Python convention is that it should be in all-uppercase, i.e. SENTENCE is a better variable name.
What are u and n in your code? It's hard to figure out what those variables mean. Could you be more descriptive with your naming?
Your call to .lower() on sentence is hidden after the very long sentence. For readability I wouldn't hide any function calls at the end of very long strings.
Python has multi-line string support using the """ delimiters. Using it makes the sentence and the code more readable, although at the expense of introducing newline \n characters that would show up on the histogram if they are not removed. In my code below I use the """ delimiter and remove the \n characters I introduced to break the string into screen-width-sized chunks. PEP8 convention is that code lines shouldn't be more than about 80 characters long.
You should consider breaking this code up into two functions, one to make generate the data, and one to make the graph, but we can leave that for another time.

Substance

Since your sentence is a Python string (not a NumPy character array), you can generate the data for your histogram quite easily by using the Counter data type that is available in the collections module. It's designed for exactly applications like this. Doing so will let you avoid the complications of bin edges vs. bin centers that stem from using np.histogram entirely.

Putting all these ideas together:

import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
SENTENCE = """Only the fool would take trouble to verify that his sentence was composed of ten a's, three b's, four c's, 
four d's, forty-six e's, sixteen f's, four g's, thirteen h's, fifteen i's, two k's, nine l's, four m's, twenty-five n's, 
twenty-four o's, five p's, sixteen r's, forty-one s's, thirty-seven t's, ten u's, eight v's, eight w's, four x's, 
eleven y's, twenty-seven commas, twenty-three apostrophes, seven hyphens and, last but not least, a single !"""
# generate histogram
letters_hist = Counter(SENTENCE.lower().replace('\n', ''))
counts = letters_hist.values()
letters = letters_hist.keys()
# graph data
bar_x_locations = np.arange(len(counts))
plt.bar(bar_x_locations, counts, align = 'center')
plt.xticks(bar_x_locations, letters)
plt.grid()
plt.show()

Other

It wasn't anything you did with plt.xticks that led to the gaps. That's the matplotlib default. If you want a "tight" border to the graph, try adding a plt.xlim(-0.5, len(counts) - 0.5) before the plt.show().

Question 3

Excellent, thanks! I tried to use the multiline string, but didn't get as far as using replace.

Question 4

S = ('a ' 'b ' 'c') is exactly the same as S = 'a b c'. But using the former syntax you can use implicit line continuation within parenthesis and make a "single-line" string span on multiple line.

Question 5

Good tip Mathias. Still probably best approach is probably writing the code to read the sentence from a separate text file. Code and data rarely belong together.

Question 6

There are few improvements that could be suggested, specially to vectorize things and use existing function to do bulk of operations. Those are listed below :

You are already using np.unique on numbers to get u : u = np.unique(numbers). Now np.unique also has an optional argument to return counts as return_count. This should handle the intended binning operation.
Rest of the work is all about creating the x-axis to cover all characters. For those, we can keep most of the existing code.

So, finally we would have an implementation like so -

# Get the IDs corresponding to each input character in input sentence
numbers = np.array([ord(c) for c in sentence])
# Performing counting/binning and also setup x-axis IDs for plotting
hist = np.unique(numbers,return_counts=True)[1]
bins = np.arange(0,hist.size)
# Finally, plot the results
plt.bar(bins, hist, align='center')
plt.xticks(bins, [str(unichr(n)) for n in set(numbers)])
plt.grid()
plt.show()

Curt F. Curt F. 1,65611 silver badges22 bronze badges · Accepted Answer · 2016-05-27 16:32:38Z

Your code is pretty good! I have only one substantive and a few stylistic suggestion.

Style

Since sentence is a hard-coded variable, Python convention is that it should be in all-uppercase, i.e. SENTENCE is a better variable name.
What are u and n in your code? It's hard to figure out what those variables mean. Could you be more descriptive with your naming?
Your call to .lower() on sentence is hidden after the very long sentence. For readability I wouldn't hide any function calls at the end of very long strings.
Python has multi-line string support using the """ delimiters. Using it makes the sentence and the code more readable, although at the expense of introducing newline \n characters that would show up on the histogram if they are not removed. In my code below I use the """ delimiter and remove the \n characters I introduced to break the string into screen-width-sized chunks. PEP8 convention is that code lines shouldn't be more than about 80 characters long.
You should consider breaking this code up into two functions, one to make generate the data, and one to make the graph, but we can leave that for another time.

Substance

Since your sentence is a Python string (not a NumPy character array), you can generate the data for your histogram quite easily by using the Counter data type that is available in the collections module. It's designed for exactly applications like this. Doing so will let you avoid the complications of bin edges vs. bin centers that stem from using np.histogram entirely.

Putting all these ideas together:

import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
SENTENCE = """Only the fool would take trouble to verify that his sentence was composed of ten a's, three b's, four c's, 
four d's, forty-six e's, sixteen f's, four g's, thirteen h's, fifteen i's, two k's, nine l's, four m's, twenty-five n's, 
twenty-four o's, five p's, sixteen r's, forty-one s's, thirty-seven t's, ten u's, eight v's, eight w's, four x's, 
eleven y's, twenty-seven commas, twenty-three apostrophes, seven hyphens and, last but not least, a single !"""
# generate histogram
letters_hist = Counter(SENTENCE.lower().replace('\n', ''))
counts = letters_hist.values()
letters = letters_hist.keys()
# graph data
bar_x_locations = np.arange(len(counts))
plt.bar(bar_x_locations, counts, align = 'center')
plt.xticks(bar_x_locations, letters)
plt.grid()
plt.show()

Other

It wasn't anything you did with plt.xticks that led to the gaps. That's the matplotlib default. If you want a "tight" border to the graph, try adding a plt.xlim(-0.5, len(counts) - 0.5) before the plt.show().

Excellent, thanks! I tried to use the multiline string, but didn't get as far as using replace.
S = ('a ' 'b ' 'c') is exactly the same as S = 'a b c'. But using the former syntax you can use implicit line continuation within parenthesis and make a "single-line" string span on multiple line.
Good tip Mathias. Still probably best approach is probably writing the code to read the sentence from a separate text file. Code and data rarely belong together.

Stack Exchange Network

Histogram of a string

2 Answers 2

Style

Substance

Other

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Histogram of a string

2 Answers 2

Style

Substance

Other

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions