I wanted to create a histogram from a list of positive integers. I want to bin it so that I show all single numbers, say K through N, with more than k elements in the data set, as well as the number of elements greater than N.
'''
The goal is to max a histogram from integer data.
The last bin should represent all cases with at least K elements.
x x x x
x x x x x x x
x x x x x ----> x x x x
____________ __________
1 2 3 4 5 6 1 2 3 >3
'''
import matplotlib.pyplot as plt
import numpy as np
# Insert your favorite integer data here
data = [1, 1, 1, 2, 2, 2, 3, 3, 5, 6]
# Vanilla histogram for reference
hist, bins = np.histogram(data, bins=np.arange(1, 15))
center = (bins[:-1] + bins[1:]) / 2 - 0.5
f, ax = plt.subplots()
ax.bar(center, hist, align='center', edgecolor='k')
ax.set_xticks(center)
ax.set_title('vanilla hist')
plt.savefig('vanillahist')
plt.clf()
# Select the point after the last time we see at least k elements
K = 2
maxnum = bins[1:-1][np.abs(np.diff(hist >= K)) > 0][-1]
# filter the bins from numpy to only contain this point and those prior
center = bins[bins <= maxnum]
# filter frequency data from numpy;
# bins/hist are ordered so that the first entries line up
newhist = hist[(bins[:-1] <= maxnum)]
newhist[-1] += np.sum(hist[(bins[:-1] > maxnum)])
# make the plot, hopefully as advertised!
f, ax = plt.subplots()
ax.bar(center, newhist, align='center', edgecolor='k')
ax.set_xticks(center)
ax.set_xticklabels(list(center[:-1].astype(int)) + ['> %i' % (maxnum - 1)])
plt.savefig('myhist')
plt.clf()
This involved a lot of trial and error, and I'm still not 100% sure this can handle all cases, though it's passed every test I've tried so far. Could I have made this code more readable? I feel particularly unsure about lines 28-38. My justification for the [:-1]
line is that the first entry of bins
corresponds to the first entry of hist
.
1 Answer 1
Nice work! You might also be interested in Matplotlib histogram with collection bin for high values.
I like the ascii-art explanation :-)
Things I see that could improve the code:
- Put the histogram building in a function. This way others can import it / use it / copy-paste it more easily. Then it also becomes clearer what is the required input (data) and what are parameters that could be set by default (
K
,bins
,title
). - The name
center
is misleading. It is a list. And in fact those are the bins. So I would call itbins
, overwriting the old value. - Instead of
newhist
you could call itbin_values
orbin_heights
.
numpy.digitize
. It returns the bin each event is in, including0
for underflow andlen(bins)
for overflow. You then just need to call the normal histogram on this and fix the labels. \$\endgroup\$