Plotting a histogram from pre-counted data in Matplotlib

Question:

I’d like to use Matplotlib to plot a histogram over data that’s been pre-counted. For example, say I have the raw data

data = [1, 2, 2, 3, 4, 5, 5, 5, 5, 6, 10]

Given this data, I can use

pylab.hist(data, bins=[...])

to plot a histogram.

In my case, the data has been pre-counted and is represented as a dictionary:

counted_data = {1: 1, 2: 2, 3: 1, 4: 1, 5: 4, 6: 1, 10: 1}

Ideally, I’d like to pass this pre-counted data to a histogram function that lets me control the bin widths, plot range, etc, as if I had passed it the raw data. As a workaround, I’m expanding my counts into the raw data:

data = list(chain.from_iterable(repeat(value, count)
            for (value, count) in counted_data.iteritems()))

This is inefficient when counted_data contains counts for millions of data points.

Is there an easier way to use Matplotlib to produce a histogram from my pre-counted data?

Alternatively, if it’s easiest to just bar-plot data that’s been pre-binned, is there a convenience method to “roll-up” my per-item counts into binned counts?

Asked By: Josh Rosen

||

Answers:

You can use the weights keyword argument to np.histgram (which plt.hist calls underneath)

val, weight = zip(*[(k, v) for k,v in counted_data.items()])
plt.hist(val, weights=weight)

Assuming you only have integers as the keys, you can also use bar directly:

min_bin = np.min(counted_data.keys())
max_bin = np.max(counted_data.keys())

bins = np.arange(min_bin, max_bin + 1)
vals = np.zeros(max_bin - min_bin + 1)

for k,v in counted_data.items():
    vals[k - min_bin] = v

plt.bar(bins, vals, ...)

where … is what ever arguments you want to pass to bar (doc)

If you want to re-bin your data see Histogram with separate list denoting frequency

Answered By: tacaswell

I used pyplot.hist‘s weights option to weight each key by its value, producing the histogram that I wanted:


pylab.hist(counted_data.keys(), weights=counted_data.values(), bins=range(50))

This allows me to rely on hist to re-bin my data.

Answered By: Josh Rosen

the length of the “bins” array should be longer than the length of “counts”. Here’s the way to fully reconstruct the histogram:

import numpy as np
import matplotlib.pyplot as plt
bins = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]).astype(float)
counts = np.array([5, 3, 4, 5, 6, 1, 3, 7]).astype(float)
centroids = (bins[1:] + bins[:-1]) / 2
counts_, bins_, _ = plt.hist(centroids, bins=len(counts),
                             weights=counts, range=(min(bins), max(bins)))
plt.show()
assert np.allclose(bins_, bins)
assert np.allclose(counts_, counts)
Answered By: R. Yang

You can also use seaborn to plot the histogram :

import seaborn as sns

sns.distplot(
    list(
        counted_data.keys()
    ), 
    hist_kws={
        "weights": list(counted_data.values())
    }
)
Answered By: youssef mhiri

Adding to tacaswell’s comment, plt.bar can be much more efficient than plt.hist here for large numbers of bins (>1e4). Especially for a crowded random plot where you only need plot the highest bars because the width required to see them will cover most of their neighbors anyway. You can pick out the highest bars and plot them with

i, = np.where(vals > min_height)
plt.bar(i,vals[i],width=len(bins)//50)

Other statistical trends may prefer to instead plot every 100th bar or something similar.

The trick here is that plt.hist wants to plot all of your bins whereas plt.bar will let you just plot the sparser set of visible bins.

Answered By: Max

hist uses bar under the hood, this will produce something similar to what hist creates (assumes bins of equal size):

bins = [1,2,3]
heights = [10,20,30]

ax = plt.gca()
ax.bar(bins, heights, align='center', width=bins[-1] - bins[-2])
Answered By: Edu
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.