Average function excluding value for each row in Pandas DataFrame

Question:

Is there a simple way to calculate the average for each column in a pandas DataFrame and for each row exclude the specific value? The x in each row below marks the value in each iteration to be excluded from the calculation:

    a    b                     a    b                    a    b
0   1    2                 0   x    x                0   1    2
1   2    4    first loop   1   2    4   second loop  1   x    x   etc.
2   3    6       --->      2   3    6     --->       2   3    6   --->
3   4    8                 3   4    8                3   4    8
4   5   10                 4   5   10                4   5   10
                           ____________              _____________
                   col_avg:  3.5  7.0        col_avg: 3.25  6.5

Using only 4 values at each iteration, as the "x" is excluded from data set

resulting in a new DataFrame

    a_x    b_x
0   3.5    7.0
1   3.25   6.5
2   3.0    6.0
3   2.75   5.5
4   2.5    5.0

Thanks

/N

Asked By: gussilago

||

Answers:

To start off with the first step, let’s say we were interested in summing instead of calculating the average values. In that case, we would be adding all elems along each col except the current elem. Other way to look at it/solve it would be to sum all elems along each col and subtract the current elem itself. So, essentially we could get the sum for all columns with df.sum(0) and simply subtract df from it, keeping the axis
aligned. Broadcasting would take care of performing these operations across all cols in one go.

To get the second step of averaging, we simply divide by the number of elems involved for each col’s summing, i.e. df.shape[0]-1.

Thus, we would have a vectorized solution, like so –

df_out = (df.sum(0) - df)/float(df.shape[0]-1)

Sample run –

In [128]: df
Out[128]: 
   a   b
0  1   2
1  2   4
2  3   6
3  4   8
4  5  10

In [129]: (df.sum(0) - df)/float(df.shape[0]-1)
Out[129]: 
      a    b
0  3.50  7.0
1  3.25  6.5
2  3.00  6.0
3  2.75  5.5
4  2.50  5.0

To set the column names to the desired ones, do : df_out.columns = ['a_x','b_x'].

Answered By: Divakar

I ran into a similar problem, but needed both the mean and the standard deviation, excluding the current row

Standard deviation was quite a bit harder to calculate due to needing all of the values and the mean of the groups

The following can be easily extended to pretty much any of the aggregating functions from numpy

In [266]: df = pd.DataFrame({"a": np.arange(5) + 1, "b": 2 * (np.arange(5) + 1)})

In [267]: df
Out[267]:
   a   b
0  1   2
1  2   4
2  3   6
3  4   8
4  5  10

In [268]: import numpy.ma as ma
     ...: import numpy as np

Create a 3-dimensional numpy array by stacking the DataFrame’s values for as many rows as there are

In [269]: t = np.stack(tuple(df.values for _ in range(len(df.index))), axis=0)

In [270]: t
Out[270]:
array([[[ 1,  2],
        [ 2,  4],
        [ 3,  6],
        [ 4,  8],
        [ 5, 10]],

       [[ 1,  2],
        [ 2,  4],
        [ 3,  6],
        [ 4,  8],
        [ 5, 10]],

       [[ 1,  2],
        [ 2,  4],
        [ 3,  6],
        [ 4,  8],
        [ 5, 10]],

       [[ 1,  2],
        [ 2,  4],
        [ 3,  6],
        [ 4,  8],
        [ 5, 10]],

       [[ 1,  2],
        [ 2,  4],
        [ 3,  6],
        [ 4,  8],
        [ 5, 10]]])

Create a set of stacked identity matrices to use as a mask (i.e. exclude the current row) in the aggregating function

In [271]: e = np.stack(tuple(np.eye(len(df.index)) for _ in range(len(df.columns))), axis=2)

In [272]: e
Out[272]:
array([[[1., 1.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [1., 1.],
        [0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [1., 1.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.],
        [1., 1.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [1., 1.]]])

Construct a masked array (numpy.ma.array) from the stacked data and identities

In [275]: masked_array = ma.array(t, mask=e)

In [276]: masked_array
Out[276]:
masked_array(
  data=[[[--, --],
         [2, 4],
         [3, 6],
         [4, 8],
         [5, 10]],

        [[1, 2],
         [--, --],
         [3, 6],
         [4, 8],
         [5, 10]],

        [[1, 2],
         [2, 4],
         [--, --],
         [4, 8],
         [5, 10]],

        [[1, 2],
         [2, 4],
         [3, 6],
         [--, --],
         [5, 10]],

        [[1, 2],
         [2, 4],
         [3, 6],
         [4, 8],
         [--, --]]],
  mask=[[[ True,  True],
         [False, False],
         [False, False],
         [False, False],
         [False, False]],

        [[False, False],
         [ True,  True],
         [False, False],
         [False, False],
         [False, False]],

        [[False, False],
         [False, False],
         [ True,  True],
         [False, False],
         [False, False]],

        [[False, False],
         [False, False],
         [False, False],
         [ True,  True],
         [False, False]],

        [[False, False],
         [False, False],
         [False, False],
         [False, False],
         [ True,  True]]],
  fill_value=999999)

And finally calculate your aggregate values

In [277]: np.nanmean(masked_array, axis=1).data
Out[277]:
array([[3.5 , 7.  ],
       [3.25, 6.5 ],
       [3.  , 6.  ],
       [2.75, 5.5 ],
       [2.5 , 5.  ]])

In [278]: np.nanstd(masked_array, axis=1).data
Out[278]:
array([[1.11803399, 2.23606798],
       [1.47901995, 2.95803989],
       [1.58113883, 3.16227766],
       [1.47901995, 2.95803989],
       [1.11803399, 2.23606798]])
Answered By: Jim

Here is a way using pd.concat() and drop()

pd.concat([df.drop(r).mean() for r in df.index],keys=df.index).unstack()

or

pd.concat([df.drop(r).mean() for r in df.index],axis=1).T

or

df.apply(lambda x: [np.roll(x,-i)[1:].mean() for i in range(df.shape[0])])

Output:

      a    b
0  3.50  7.0
1  3.25  6.5
2  3.00  6.0
3  2.75  5.5
4  2.50  5.0
Answered By: rhug123
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.