Plotting a histogram from pre-counted data in Matplotlib
Question:
I’d like to use Matplotlib to plot a histogram over data that’s been pre-counted. For example, say I have the raw data
data = [1, 2, 2, 3, 4, 5, 5, 5, 5, 6, 10]
Given this data, I can use
pylab.hist(data, bins=[...])
to plot a histogram.
In my case, the data has been pre-counted and is represented as a dictionary:
counted_data = {1: 1, 2: 2, 3: 1, 4: 1, 5: 4, 6: 1, 10: 1}
Ideally, I’d like to pass this pre-counted data to a histogram function that lets me control the bin widths, plot range, etc, as if I had passed it the raw data. As a workaround, I’m expanding my counts into the raw data:
data = list(chain.from_iterable(repeat(value, count)
for (value, count) in counted_data.iteritems()))
This is inefficient when counted_data
contains counts for millions of data points.
Is there an easier way to use Matplotlib to produce a histogram from my pre-counted data?
Alternatively, if it’s easiest to just bar-plot data that’s been pre-binned, is there a convenience method to “roll-up” my per-item counts into binned counts?
Answers:
You can use the weights
keyword argument to np.histgram
(which plt.hist
calls underneath)
val, weight = zip(*[(k, v) for k,v in counted_data.items()])
plt.hist(val, weights=weight)
Assuming you only have integers as the keys, you can also use bar
directly:
min_bin = np.min(counted_data.keys())
max_bin = np.max(counted_data.keys())
bins = np.arange(min_bin, max_bin + 1)
vals = np.zeros(max_bin - min_bin + 1)
for k,v in counted_data.items():
vals[k - min_bin] = v
plt.bar(bins, vals, ...)
where … is what ever arguments you want to pass to bar
(doc)
If you want to re-bin your data see Histogram with separate list denoting frequency
I used pyplot.hist‘s weights
option to weight each key by its value, producing the histogram that I wanted:
pylab.hist(counted_data.keys(), weights=counted_data.values(), bins=range(50))
This allows me to rely on hist
to re-bin my data.
the length of the “bins” array should be longer than the length of “counts”. Here’s the way to fully reconstruct the histogram:
import numpy as np
import matplotlib.pyplot as plt
bins = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]).astype(float)
counts = np.array([5, 3, 4, 5, 6, 1, 3, 7]).astype(float)
centroids = (bins[1:] + bins[:-1]) / 2
counts_, bins_, _ = plt.hist(centroids, bins=len(counts),
weights=counts, range=(min(bins), max(bins)))
plt.show()
assert np.allclose(bins_, bins)
assert np.allclose(counts_, counts)
You can also use seaborn to plot the histogram :
import seaborn as sns
sns.distplot(
list(
counted_data.keys()
),
hist_kws={
"weights": list(counted_data.values())
}
)
Adding to tacaswell’s comment, plt.bar
can be much more efficient than plt.hist
here for large numbers of bins (>1e4). Especially for a crowded random plot where you only need plot the highest bars because the width required to see them will cover most of their neighbors anyway. You can pick out the highest bars and plot them with
i, = np.where(vals > min_height)
plt.bar(i,vals[i],width=len(bins)//50)
Other statistical trends may prefer to instead plot every 100th bar or something similar.
The trick here is that plt.hist
wants to plot all of your bins whereas plt.bar
will let you just plot the sparser set of visible bins.
hist
uses bar
under the hood, this will produce something similar to what hist
creates (assumes bins of equal size):
bins = [1,2,3]
heights = [10,20,30]
ax = plt.gca()
ax.bar(bins, heights, align='center', width=bins[-1] - bins[-2])
I’d like to use Matplotlib to plot a histogram over data that’s been pre-counted. For example, say I have the raw data
data = [1, 2, 2, 3, 4, 5, 5, 5, 5, 6, 10]
Given this data, I can use
pylab.hist(data, bins=[...])
to plot a histogram.
In my case, the data has been pre-counted and is represented as a dictionary:
counted_data = {1: 1, 2: 2, 3: 1, 4: 1, 5: 4, 6: 1, 10: 1}
Ideally, I’d like to pass this pre-counted data to a histogram function that lets me control the bin widths, plot range, etc, as if I had passed it the raw data. As a workaround, I’m expanding my counts into the raw data:
data = list(chain.from_iterable(repeat(value, count)
for (value, count) in counted_data.iteritems()))
This is inefficient when counted_data
contains counts for millions of data points.
Is there an easier way to use Matplotlib to produce a histogram from my pre-counted data?
Alternatively, if it’s easiest to just bar-plot data that’s been pre-binned, is there a convenience method to “roll-up” my per-item counts into binned counts?
You can use the weights
keyword argument to np.histgram
(which plt.hist
calls underneath)
val, weight = zip(*[(k, v) for k,v in counted_data.items()])
plt.hist(val, weights=weight)
Assuming you only have integers as the keys, you can also use bar
directly:
min_bin = np.min(counted_data.keys())
max_bin = np.max(counted_data.keys())
bins = np.arange(min_bin, max_bin + 1)
vals = np.zeros(max_bin - min_bin + 1)
for k,v in counted_data.items():
vals[k - min_bin] = v
plt.bar(bins, vals, ...)
where … is what ever arguments you want to pass to bar
(doc)
If you want to re-bin your data see Histogram with separate list denoting frequency
I used pyplot.hist‘s weights
option to weight each key by its value, producing the histogram that I wanted:
pylab.hist(counted_data.keys(), weights=counted_data.values(), bins=range(50))
This allows me to rely on hist
to re-bin my data.
the length of the “bins” array should be longer than the length of “counts”. Here’s the way to fully reconstruct the histogram:
import numpy as np
import matplotlib.pyplot as plt
bins = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]).astype(float)
counts = np.array([5, 3, 4, 5, 6, 1, 3, 7]).astype(float)
centroids = (bins[1:] + bins[:-1]) / 2
counts_, bins_, _ = plt.hist(centroids, bins=len(counts),
weights=counts, range=(min(bins), max(bins)))
plt.show()
assert np.allclose(bins_, bins)
assert np.allclose(counts_, counts)
You can also use seaborn to plot the histogram :
import seaborn as sns
sns.distplot(
list(
counted_data.keys()
),
hist_kws={
"weights": list(counted_data.values())
}
)
Adding to tacaswell’s comment, plt.bar
can be much more efficient than plt.hist
here for large numbers of bins (>1e4). Especially for a crowded random plot where you only need plot the highest bars because the width required to see them will cover most of their neighbors anyway. You can pick out the highest bars and plot them with
i, = np.where(vals > min_height)
plt.bar(i,vals[i],width=len(bins)//50)
Other statistical trends may prefer to instead plot every 100th bar or something similar.
The trick here is that plt.hist
wants to plot all of your bins whereas plt.bar
will let you just plot the sparser set of visible bins.
hist
uses bar
under the hood, this will produce something similar to what hist
creates (assumes bins of equal size):
bins = [1,2,3]
heights = [10,20,30]
ax = plt.gca()
ax.bar(bins, heights, align='center', width=bins[-1] - bins[-2])