Get a list of lists of elements in each bin of a 2D histogram

Question:

I’m working with 2D data, and I’m aware of how to bin the data to form a 2D histogram using np.histogram2d, and also how to find the bin-location of a particular element using np.digitize.

The code I use to find which histogram bin a particular element is located in looks something like this:

bins = [[0, 0.3, 0.5, 0.7, 1.1], [0, 0.3, 0.7, 1.1]]
values = np.random.random((10, 2))
digitised = []
for i in range(len(bins)):
    digitised.append(np.digitize(values[:, i], bins[i], right=True))
digitised = np.concatenate(digitised).reshape(2, 10)

where the first row of the ‘digitised’ list list corresponds to the x-direction and the second row for the y-direction, i.e. if digitised[0][0] = 4 and digitised[1][0] = 2, then the first element in my ‘values’ list is in the 4th x-bin and 2nd y-bin.

The code I use to compute the overall 2D histogram is:

bins_x = np.array([0, 0.3, 0.5, 0.7, 1.1])
bins_y = np.array([0, 0.3, 0.7, 1.1])
H, edge_x, edge_y = np.histogram2d(values[:, 0], values[:, 1], bins=(bins_x, bins_y))
H = H.T

and the output of the above code block would look something like this:

H:

array([[0., 3., 0., 0.],
       [1., 0., 0., 1.],
       [1., 0., 1., 3.]])

I’m interested in extracting a list of lists of elements within each overall bin. For example, in the H[0][1] entry, where there are three values, I would like to extract a list of which elements in values go into this entry, but in a more general sense, extract a list for every bin in this 2D histogram

This would be possible using a double for-loop, e.g. sorting through the x-values of the ‘digitised’ list first, then finding the y-values, and grouping them together. However, to the best of my knowledge, this would require a copious number of if statements to sort through all the individual bins, which would get quite inefficient for a larger dataset (e.g. an 8 x 7 grid compared to the 4 x 3 example here).

I would be super grateful for any advice or suggestions as to how to go about doing this, thank you!

Asked By: Raz

||

Answers:

If you are not left with Numpy only, you can use Scipy functions to calculate both the histogram and bin numbers for each element of the source 2D array.

H, edge_x, edge_y, binnumber = scipy.stats.binned_statistic_2d(
  values[:, 0],
  values[:, 1],
  None,
  bins=(bins_x, bins_y),
  statistic='count',
  expand_binnumbers=True
)

If you would like to combine all elements under their bin values you can use the following snippet:

from collections import defaultdict

bin_values = defaultdict(list)
for value_i, (bin_x, bin_y) in enumerate(binnumber.T):
  bin_values[(bin_x, bin_y)].append(values[value_i])  

So to know which elements are located in the first bin alongside X and third bin alongside Y you checks the corresponding element of the bin_values dictionary:

> bin_values[(1, 3)]
[array([0.92643067, 0.98808226]), array([0.8453115 , 0.75003263])]

Please check the documentation for more info.

scipy.stats.binned_statistics_2d

EDIT:

If you do print(bin_values[(2, 2)]) (giving that there is no entry for (2, 2)) you will get []. This list is generated automatically and placed into bin_values as soon as you look up a non-existing key in the dictionary. If you really need to see empty lists in the print(bin_values) output immediately, you can set them up like this

import itertools

for bin_index_2d in itertools.product(range(len(bins[0])), range(len(bins[1]))):
  if bin_index_2d not in bin_values:
    bin_values[bin_index_2d] = []
Answered By: Kuroneko
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.