Can't get y-axis on Matplotlib histogram to display probabilities

Question:

I have data (pd Series) that looks like (daily stock returns, n = 555):

S = perf_manual.returns
S = S[~((S-S.mean()).abs()>3*S.std())]

2014-03-31 20:00:00    0.000000
2014-04-01 20:00:00    0.000000
2014-04-03 20:00:00   -0.001950
2014-04-04 20:00:00   -0.000538
2014-04-07 20:00:00    0.000764
2014-04-08 20:00:00    0.000803
2014-04-09 20:00:00    0.001961
2014-04-10 20:00:00    0.040530
2014-04-11 20:00:00   -0.032319
2014-04-14 20:00:00   -0.008512
2014-04-15 20:00:00   -0.034109
...

I’d like to generate a probability distribution plot from this. Using:

print stats.normaltest(S)

n, bins, patches = plt.hist(S, 100, normed=1, facecolor='blue', alpha=0.75)
print np.sum(n * np.diff(bins))

(mu, sigma) = stats.norm.fit(S)
print mu, sigma
y = mlab.normpdf(bins, mu, sigma)
plt.grid(True)
l = plt.plot(bins, y, 'r', linewidth=2)

plt.xlim(-0.05,0.05)
plt.show()

I get the following:

NormaltestResult(statistic=66.587382579416982, pvalue=3.473230376732532e-15)
1.0
0.000495624926242 0.0118790391467

graph

I have the impression the y-axis is a count, but I’d like to have probabilities instead. How do I do that? I’ve tried a whole lot of StackOverflow answers and can’t figure this out.

Asked By: Joël

||

Answers:

There is no easy way (that I know of) to do that using plt.hist. But you can simply bin the data using np.histogram and then normalize the data any way you want. If I understood you correctly, you want the data to display the probability to find a point in a given bin, NOT the probability distribution. That means you have to scale your data that the sum over all bins is 1. That can simply be done by doing bin_probability = n/float(n.sum()).

You will then not have a properly normalized probability distribution function (pdf) anymore, meaning that the integral over an interval will not be a probability! That is the reason, why you have to rescale your mlab.normpdf to have the same norm as your histogram. The factor needed is just the bin width, because when you start from the properly normalized binned pdf the sum over all bins times their respective width is 1. Now you want to have just the sum of bins equal to 1. So the scaling factor is the bin width.

Therefore, the code you end up with is something along the lines of:

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab

# Produce test data
S = np.random.normal(0, 0.01, size=1000)

# Histogram:
# Bin it
n, bin_edges = np.histogram(S, 100)
# Normalize it, so that every bins value gives the probability of that bin
bin_probability = n/float(n.sum())
# Get the mid points of every bin
bin_middles = (bin_edges[1:]+bin_edges[:-1])/2.
# Compute the bin-width
bin_width = bin_edges[1]-bin_edges[0]
# Plot the histogram as a bar plot
plt.bar(bin_middles, bin_probability, width=bin_width)

# Fit to normal distribution
(mu, sigma) = stats.norm.fit(S)
# The pdf should not normed anymore but scaled the same way as the data
y = mlab.normpdf(bin_middles, mu, sigma)*bin_width
l = plt.plot(bin_middles, y, 'r', linewidth=2)

plt.grid(True)
plt.xlim(-0.05,0.05)
plt.show()

And the resulting picture will be:

enter image description here

Answered By: jotasi

jotasi’s answer works, of course, but I’d like to add a very simple trick for achieving this by directly calling hist.

The trick is to use the weights parameter. By default, every data point you pass has a weight of 1. The height of each bin is then the sum of the weights of the data points that fall into that bin. Instead, if we have n points, we can simply make the weight of each point be 1 / n. Then, the sum of the weights of the points that fall into a certain bucket is also the probability that a given point is in that bucket.

In your case, just change the plot line to:

n, bins, patches = plt.hist(S, weights=np.ones_like(S) / len(S),
                            facecolor='blue', alpha=0.75)
Answered By: Gabriel

Gabriel’s answer initially didn’t work for me. But the reason was that I was also using the density=True parameter. Although it’s not explicitly mentioned anywhere, if you use this parameter matplotlib seems to ignore your weight values and doesn’t provide you any error either.

Answered By: Linn Abraham

The matplotlib plt.hist documentation itself gives hint for a simpler version of this code.

counts, bins = np.histogram(data)
weights = counts/np.sum(counts)
plt.hist(bins[:-1], bins, weights=weights)
Answered By: Linn Abraham