covariance between two columns in pandas groupby pandas
Question:
I am trying to calculate the covariance between two columns by group. I am doing doing the following:
A = pd.DataFrame({'group':['A','A','A','A','B','B','B'],
'value1':[1,2,3,4,5,6,7],
'value2':[8,5,4,3,7,8,8]})
B = A.groupby('group')
B['value1'].cov(B['value2'])
Ideally, I would like to get the covariance between X and Y and not the whole variance-covariance matrix, since I only have two columns.
Thank you,
Answers:
The following code gives you the grouped variance-covariance matrix. You can subset it as you wish to just get the covariances.
import pandas as pd
A = pd.DataFrame({'group':['A','A','A','A','B','B','B'],
'value1':[1,2,3,4,5,6,7],
'value2':[8,5,4,3,7,8,8]})
print A.groupby('group').cov()
You are almost there, only that you do not clear understand the groupby object, see Pandas-GroupBy for more details.
For your problem, if I understand correctly, you would like to calculate cov between two columns in same group.
The simplest one is to use groupeby.cov
function, which gives pairwise cov between groups.
A.groupby('group').cov()
value1 value2
group
A value1 1.666667 -2.666667
value2 -2.666667 4.666667
B value1 1.000000 0.500000
value2 0.500000 0.333333
If you only need cov(grouped_v1, grouped_v2)
grouped = A.groupby('group')
grouped.apply(lambda x: x['value1'].cov(x['value2']))
group
A -2.666667
B 0.500000
In which, grouped
is a groupby
object. For grouped.apply
function, it need a callback function as argument and each group will be the argument for the callback function. Here, the callback function is a lambda
function, and the argument x
is a group (a DataFrame).
Hope this will be helpful for your understanding of groupby.
If you’re looking for cov()
of specific two columns, you can use df.Age.cov(df.Salary)
Assuming that Age and salary are two of many columns of the dataFrame. This is useful for only two columns.
Here is an alternative solution that estimates cov(value1, value2)
within each group, but doesn’t use .apply()
:
import pandas as pd
A = pd.DataFrame({'group':['A','A','A','A','B','B','B'],
'value1':[1,2,3,4,5,6,7],
'value2':[8,5,4,3,7,8,8]})
B = A.groupby('group')
cov_a_b = B[['value1', 'value2']].cov(ddof=0)['value1'].unstack()['value2']
As an additional note somewhat related to the question, you should be careful when using the NumPy/Pandas implementations of covariance, as they use a degrees of freedom correction of 1 by default (unlike their implementations of variance, which do not include a degrees of freedom correction by default). This is why I included ddof=0
.
I am trying to calculate the covariance between two columns by group. I am doing doing the following:
A = pd.DataFrame({'group':['A','A','A','A','B','B','B'],
'value1':[1,2,3,4,5,6,7],
'value2':[8,5,4,3,7,8,8]})
B = A.groupby('group')
B['value1'].cov(B['value2'])
Ideally, I would like to get the covariance between X and Y and not the whole variance-covariance matrix, since I only have two columns.
Thank you,
The following code gives you the grouped variance-covariance matrix. You can subset it as you wish to just get the covariances.
import pandas as pd
A = pd.DataFrame({'group':['A','A','A','A','B','B','B'],
'value1':[1,2,3,4,5,6,7],
'value2':[8,5,4,3,7,8,8]})
print A.groupby('group').cov()
You are almost there, only that you do not clear understand the groupby object, see Pandas-GroupBy for more details.
For your problem, if I understand correctly, you would like to calculate cov between two columns in same group.
The simplest one is to use groupeby.cov
function, which gives pairwise cov between groups.
A.groupby('group').cov()
value1 value2
group
A value1 1.666667 -2.666667
value2 -2.666667 4.666667
B value1 1.000000 0.500000
value2 0.500000 0.333333
If you only need cov(grouped_v1, grouped_v2)
grouped = A.groupby('group')
grouped.apply(lambda x: x['value1'].cov(x['value2']))
group
A -2.666667
B 0.500000
In which, grouped
is a groupby
object. For grouped.apply
function, it need a callback function as argument and each group will be the argument for the callback function. Here, the callback function is a lambda
function, and the argument x
is a group (a DataFrame).
Hope this will be helpful for your understanding of groupby.
If you’re looking for cov()
of specific two columns, you can use df.Age.cov(df.Salary)
Assuming that Age and salary are two of many columns of the dataFrame. This is useful for only two columns.
Here is an alternative solution that estimates cov(value1, value2)
within each group, but doesn’t use .apply()
:
import pandas as pd
A = pd.DataFrame({'group':['A','A','A','A','B','B','B'],
'value1':[1,2,3,4,5,6,7],
'value2':[8,5,4,3,7,8,8]})
B = A.groupby('group')
cov_a_b = B[['value1', 'value2']].cov(ddof=0)['value1'].unstack()['value2']
As an additional note somewhat related to the question, you should be careful when using the NumPy/Pandas implementations of covariance, as they use a degrees of freedom correction of 1 by default (unlike their implementations of variance, which do not include a degrees of freedom correction by default). This is why I included ddof=0
.