covariance between two columns in pandas groupby pandas

Question:

I am trying to calculate the covariance between two columns by group. I am doing doing the following:

A = pd.DataFrame({'group':['A','A','A','A','B','B','B'],
                  'value1':[1,2,3,4,5,6,7],
                  'value2':[8,5,4,3,7,8,8]})

B = A.groupby('group')

B['value1'].cov(B['value2'])

Ideally, I would like to get the covariance between X and Y and not the whole variance-covariance matrix, since I only have two columns.

Thank you,

Asked By: dleal

||

Answers:

The following code gives you the grouped variance-covariance matrix. You can subset it as you wish to just get the covariances.

import pandas as pd
A = pd.DataFrame({'group':['A','A','A','A','B','B','B'],
                  'value1':[1,2,3,4,5,6,7],
                  'value2':[8,5,4,3,7,8,8]})
print A.groupby('group').cov()
Answered By: scomes

You are almost there, only that you do not clear understand the groupby object, see Pandas-GroupBy for more details.

For your problem, if I understand correctly, you would like to calculate cov between two columns in same group.

The simplest one is to use groupeby.cov function, which gives pairwise cov between groups.

A.groupby('group').cov()

                value1    value2
group                           
A     value1  1.666667 -2.666667
      value2 -2.666667  4.666667
B     value1  1.000000  0.500000
      value2  0.500000  0.333333

If you only need cov(grouped_v1, grouped_v2)

grouped = A.groupby('group')
grouped.apply(lambda x: x['value1'].cov(x['value2']))

group
A   -2.666667
B    0.500000

In which, grouped is a groupby object. For grouped.apply function, it need a callback function as argument and each group will be the argument for the callback function. Here, the callback function is a lambda function, and the argument x is a group (a DataFrame).

Hope this will be helpful for your understanding of groupby.

Answered By: rojeeer

If you’re looking for cov() of specific two columns, you can use df.Age.cov(df.Salary)
Assuming that Age and salary are two of many columns of the dataFrame. This is useful for only two columns.

Answered By: Tek Acharya

Here is an alternative solution that estimates cov(value1, value2) within each group, but doesn’t use .apply():

import pandas as pd

A = pd.DataFrame({'group':['A','A','A','A','B','B','B'],
                  'value1':[1,2,3,4,5,6,7],
                  'value2':[8,5,4,3,7,8,8]})

B = A.groupby('group')

cov_a_b = B[['value1', 'value2']].cov(ddof=0)['value1'].unstack()['value2']

As an additional note somewhat related to the question, you should be careful when using the NumPy/Pandas implementations of covariance, as they use a degrees of freedom correction of 1 by default (unlike their implementations of variance, which do not include a degrees of freedom correction by default). This is why I included ddof=0.

Answered By: Adam Oppenheimer
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.