Group and average NumPy matrix

Question

Say I have an arbitrary numpy matrix that looks like this:

arr = [[  6.0   12.0   1.0]
       [  7.0   9.0   1.0]
       [  8.0   7.0   1.0]
       [  4.0   3.0   2.0]
       [  6.0   1.0   2.0]
       [  2.0   5.0   2.0]
       [  9.0   4.0   3.0]
       [  2.0   1.0   4.0]
       [  8.0   4.0   4.0]
       [  3.0   5.0   4.0]]

What would be an efficient way of averaging rows that are grouped by their third column number?

The expected output would be:

result = [[  7.0  9.33  1.0]
          [  4.0  3.0  2.0]
          [  9.0  4.0  3.0]
          [  4.33  3.33  4.0]]

Asked By: Algorithm

||

Source

Answer 1

solution

from itertools import groupby
from operator import itemgetter

arr = [[6.0, 12.0, 1.0],
       [7.0, 9.0, 1.0],
       [8.0, 7.0, 1.0],
       [4.0, 3.0, 2.0],
       [6.0, 1.0, 2.0],
       [2.0, 5.0, 2.0],
       [9.0, 4.0, 3.0],
       [2.0, 1.0, 4.0],
       [8.0, 4.0, 4.0],
       [3.0, 5.0, 4.0]]

result = []

for groupByID, rows in groupby(arr, key=itemgetter(2)):
    position1, position2, counter = 0, 0, 0
    for row in rows:
        position1+=row[0]
        position2+=row[1]
        counter+=1
    result.append([position1/counter, position2/counter, groupByID])

print(result)

would output:

[[7.0, 9.333333333333334, 1.0]]
[[4.0, 3.0, 2.0]]
[[9.0, 4.0, 3.0]]
[[4.333333333333333, 3.3333333333333335, 4.0]]

Answered By: DmitrySemenov

Answer 2

You can do:

for x in sorted(np.unique(arr[...,2])):
    results.append([np.average(arr[np.where(arr[...,2]==x)][...,0]), 
                    np.average(arr[np.where(arr[...,2]==x)][...,1]),
                    x])

Testing:

>>> arr
array([[  6.,  12.,   1.],
       [  7.,   9.,   1.],
       [  8.,   7.,   1.],
       [  4.,   3.,   2.],
       [  6.,   1.,   2.],
       [  2.,   5.,   2.],
       [  9.,   4.,   3.],
       [  2.,   1.,   4.],
       [  8.,   4.,   4.],
       [  3.,   5.,   4.]])
>>> results=[]
>>> for x in sorted(np.unique(arr[...,2])):
...     results.append([np.average(arr[np.where(arr[...,2]==x)][...,0]), 
...                     np.average(arr[np.where(arr[...,2]==x)][...,1]),
...                     x])
... 
>>> results
[[7.0, 9.3333333333333339, 1.0], [4.0, 3.0, 2.0], [9.0, 4.0, 3.0], [4.333333333333333, 3.3333333333333335, 4.0]]

The array arr does not need to be sorted, and all the intermediate arrays are views (ie, not new arrays of data). The average is calculated efficiently directly from those views.

Or, for a pure numpy solution:

groups = arr[:,2].copy()

_ndx = np.argsort(groups)
_id, _pos, grp_count  = np.unique(groups[_ndx], 
                return_index=True, 
                return_counts=True)

grp_sum = np.add.reduceat(arr[_ndx], _pos, axis=0)
grp_mean = grp_sum / grp_count[:,None]  

>>> grp_mean
array([[7.        , 9.33333333, 1.        ],
       [4.        , 3.        , 2.        ],
       [9.        , 4.        , 3.        ],
       [4.33333333, 3.33333333, 4.        ]])

Answered By: dawg

Answer 3

arr = np.array(
[[  6.0,   12.0,   1.0],
 [  7.0,   9.0,   1.0],
 [  8.0,   7.0,   1.0],
 [  4.0,   3.0,   2.0],
 [  6.0,   1.0,   2.0],
 [  2.0,   5.0,   2.0],
 [  9.0,   4.0,   3.0],
 [  2.0,   1.0,   4.0],
 [  8.0,   4.0,   4.0],
 [  3.0,   5.0,   4.0]])
np.array([a.mean(0) for a in np.split(arr, np.argwhere(np.diff(arr[:, 2])) + 1)])

Answered By: HYRY

Answer 4

A compact solution is to use numpy_indexed (disclaimer: I am its author), which implements a fully vectorized solution:

import numpy_indexed as npi
npi.group_by(arr[:, 2]).mean(arr)

Answered By: Eelco Hoogendoorn

Group and average NumPy matrix

Question:

Answers: