Matplotlib histogram with collection bin for high values

Question:

I have an array with values, and I want to create a histogram of it. I am mainly interested in the low end numbers, and want to collect every number above 300 in one bin. This bin should have the same width as all other (equally wide) bins. How can I do this?

Note: this question is related to this question: Defining bin width/x-axis scale in Matplotlib histogram

This is what I tried so far:

import matplotlib.pyplot as plt
import numpy as np

def plot_histogram_01():
    np.random.seed(1)
    values_A = np.random.choice(np.arange(600), size=200, replace=True).tolist()
    values_B = np.random.choice(np.arange(600), size=200, replace=True).tolist()

    bins = [0, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 600]

    fig, ax = plt.subplots(figsize=(9, 5))
    _, bins, patches = plt.hist([values_A, values_B], normed=1,  # normed is deprecated and will be replaced by density
                                bins=bins,
                                color=['#3782CC', '#AFD5FA'],
                                label=['A', 'B'])

    xlabels = np.array(bins[1:], dtype='|S4')
    xlabels[-1] = '300+'

    N_labels = len(xlabels)
    plt.xlim([0, 600])
    plt.xticks(25 * np.arange(N_labels) + 12.5)
    ax.set_xticklabels(xlabels)

    plt.yticks([])
    plt.title('')
    plt.setp(patches, linewidth=0)
    plt.legend()

    fig.tight_layout()
    plt.savefig('my_plot_01.png')
    plt.close()

This is the result, which does not look nice:
enter image description here

I then changed the line with xlim in it:

plt.xlim([0, 325])

With the following result:
enter image description here

It looks more or less as I want it, but the last bin is not visible now. Which trick am I missing to visualize this last bin with a width of 25?

Asked By: physicalattraction

||

Answers:

Sorry I am not familiar with matplotlib. So I have a dirty hack for you. I just put all values that greater than 300 in one bin and changed the bin size.

The root of the problem is that matplotlib tries to put all bins on the plot. In R I would convert my bins to factor variable, so they are not treated as real numbers.

import matplotlib.pyplot as plt
import numpy as np

def plot_histogram_01():
    np.random.seed(1)
    values_A = np.random.choice(np.arange(600), size=200, replace=True).tolist()
    values_B = np.random.choice(np.arange(600), size=200, replace=True).tolist()
    values_A_to_plot = [301 if i > 300 else i for i in values_A]
    values_B_to_plot = [301 if i > 300 else i for i in values_B]

    bins = [0, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325]

    fig, ax = plt.subplots(figsize=(9, 5))
    _, bins, patches = plt.hist([values_A_to_plot, values_B_to_plot], normed=1,  # normed is deprecated and will be replaced by density
                                bins=bins,
                                color=['#3782CC', '#AFD5FA'],
                                label=['A', 'B'])

    xlabels = np.array(bins[1:], dtype='|S4')
    xlabels[-1] = '300+'

    N_labels = len(xlabels)

    plt.xticks(25 * np.arange(N_labels) + 12.5)
    ax.set_xticklabels(xlabels)

    plt.yticks([])
    plt.title('')
    plt.setp(patches, linewidth=0)
    plt.legend()

    fig.tight_layout()
    plt.savefig('my_plot_01.png')
    plt.close()

plot_histogram_01()

enter image description here

Answered By: Artem Fedosov

Numpy has a handy function for dealing with this: np.clip. Despite what the name may sound like, it doesn’t remove values, it just limits them to the range you specify. Basically, it does Artem’s “dirty hack” inline. You can leave the values as they are, but in the hist call, just wrap the array in an np.clip call, like so

plt.hist(np.clip(values_A, bins[0], bins[-1]), bins=bins)

This is nicer for a number of reasons:

  1. It’s way faster — at least for large numbers of elements. Numpy does its work at the C level. Operating on python lists (as in Artem’s list comprehension) has a lot of overhead for each element. Basically, if you ever have the option to use numpy, you should.

  2. You do it right where it’s needed, which reduces the chance of making mistakes in your code.

  3. You don’t need to keep a second copy of the array hanging around, which reduces memory usage (except within this one line) and further reduces the chances of making mistakes.

  4. Using bins[0], bins[-1] instead of hard-coding the values reduces the chances of making mistakes again, because you can change the bins just where bins was defined; you don’t need to remember to change them in the call to clip or anywhere else.

So to put it all together as in the OP:

import matplotlib.pyplot as plt
import numpy as np

def plot_histogram_01():
    np.random.seed(1)
    values_A = np.random.choice(np.arange(600), size=200, replace=True)
    values_B = np.random.choice(np.arange(600), size=200, replace=True)

    bins = np.arange(0,350,25)

    fig, ax = plt.subplots(figsize=(9, 5))
    _, bins, patches = plt.hist([np.clip(values_A, bins[0], bins[-1]),
                                 np.clip(values_B, bins[0], bins[-1])],
                                # normed=1,  # normed is deprecated; replace with density
                                density=True,
                                bins=bins, color=['#3782CC', '#AFD5FA'], label=['A', 'B'])

    xlabels = bins[1:].astype(str)
    xlabels[-1] += '+'

    N_labels = len(xlabels)
    plt.xlim([0, 325])
    plt.xticks(25 * np.arange(N_labels) + 12.5)
    ax.set_xticklabels(xlabels)

    plt.yticks([])
    plt.title('')
    plt.setp(patches, linewidth=0)
    plt.legend(loc='upper left')

    fig.tight_layout()
plot_histogram_01()

result of code above

Answered By: Mike
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.