Difference of cov and cor between R and Python
Question:
I often use R and I am new to Python.
In R, a demo of computing mean, cov and cor of given matrix
are given as follows:
X = matrix(c(1,0.5,3,7,9,6,2,8,4), nrow=3, ncol=3, byrow=FALSE)
X
# [,1] [,2] [,3]
# [1,] 1.0 7 2
# [2,] 0.5 9 8
# [3,] 3.0 6 4
M = colMeans(X) # apply(X,2,mean)
M
# [1] 1.500000 7.333333 4.666667
S = cov(X)
S
# [,1] [,2] [,3]
# [1,] 1.75 -1.750000 -1.500000
# [2,] -1.75 2.333333 3.666667
# [3,] -1.50 3.666667 9.333333
R = cor(X)
R
# [,1] [,2] [,3]
# [1,] 1.0000000 -0.8660254 -0.3711537
# [2,] -0.8660254 1.0000000 0.7857143
# [3,] -0.3711537 0.7857143 1.0000000
I want to reproduce the above in Python and I try:
import numpy as np
X = np.array([1,0.5,3,7,9,6,2,8,4]).reshape(3, 3)
X = np.transpose(X) # byrow=FALSE
X
# array([[ 1. , 7. , 2. ],
# [ 0.5, 9. , 8. ],
# [ 3. , 6. , 4. ]])
M = X.mean(axis=0) # colMeans
M
# array([ 1.5 , 7.33333333, 4.66666667])
S = np.cov(X)
S
# array([[ 10.33333333, 10.58333333, 4.83333333],
# [ 10.58333333, 21.58333333, 5.83333333],
# [ 4.83333333, 5.83333333, 2.33333333]])
R = np.corrcoef(X)
R
# array([[ 1. , 0.70866828, 0.98432414],
# [ 0.70866828, 1. , 0.82199494],
# [ 0.98432414, 0.82199494, 1. ]])
Then the results of cov and cor are different. Why?
Answers:
This is because numpy
calculates by row and R
by column. Either comment out X = np.transpose(X) # byrow=FALSE
, or use np.cov(X, rowvar=False)
.
np.cov(X, rowvar=False)
array([[ 1.75 , -1.75 , -1.5 ],
[-1.75 , 2.33333333, 3.66666667],
[-1.5 , 3.66666667, 9.33333333]])
The difference is explained in the respective documentation (emphasis mine):
Python:
help(np.cov)
rowvar : bool, optional
If rowvar
is True (default), then each row represents a
variable, with observations in the columns. Otherwise, the relationship
is transposed: each column represents a variable, while the rows
contain observations.
R:
?cov
var, cov and cor compute the variance of x and the covariance or
correlation of x and y if these are vectors. If x and y are matrices
then the covariances (or correlations) between the columns of x and
the columns of y are computed.
If I don’t transpose the array in Python, then I have exactly the same answer.
The covariance is computed by row (X[0]
returns the first row), and I suspect that R stores the data in Fortran order, whereas Python/Numpy uses C order. This explains the difference with the way mean
is computed, first axis is row in Python, not column.
You have to pass the transpose of the data matrix to numpy.cov() because numpy.cov() considers its input data matrix to have observations in each column, and variables in each row. As you can read from the documentation of np.cov() here:
https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.cov.html
Here in the code provided if you pass the Transposed matrix to np.cov() , you will get the same values as you are getting in R using cov().
not only that
the bias too… np uses 1/n-1, R probably the same… but in numpy you can set it to 1/n by putting the flag bias = FALSE… I am not sure how to do that in R.
I often use R and I am new to Python.
In R, a demo of computing mean, cov and cor of given matrix
are given as follows:
X = matrix(c(1,0.5,3,7,9,6,2,8,4), nrow=3, ncol=3, byrow=FALSE)
X
# [,1] [,2] [,3]
# [1,] 1.0 7 2
# [2,] 0.5 9 8
# [3,] 3.0 6 4
M = colMeans(X) # apply(X,2,mean)
M
# [1] 1.500000 7.333333 4.666667
S = cov(X)
S
# [,1] [,2] [,3]
# [1,] 1.75 -1.750000 -1.500000
# [2,] -1.75 2.333333 3.666667
# [3,] -1.50 3.666667 9.333333
R = cor(X)
R
# [,1] [,2] [,3]
# [1,] 1.0000000 -0.8660254 -0.3711537
# [2,] -0.8660254 1.0000000 0.7857143
# [3,] -0.3711537 0.7857143 1.0000000
I want to reproduce the above in Python and I try:
import numpy as np
X = np.array([1,0.5,3,7,9,6,2,8,4]).reshape(3, 3)
X = np.transpose(X) # byrow=FALSE
X
# array([[ 1. , 7. , 2. ],
# [ 0.5, 9. , 8. ],
# [ 3. , 6. , 4. ]])
M = X.mean(axis=0) # colMeans
M
# array([ 1.5 , 7.33333333, 4.66666667])
S = np.cov(X)
S
# array([[ 10.33333333, 10.58333333, 4.83333333],
# [ 10.58333333, 21.58333333, 5.83333333],
# [ 4.83333333, 5.83333333, 2.33333333]])
R = np.corrcoef(X)
R
# array([[ 1. , 0.70866828, 0.98432414],
# [ 0.70866828, 1. , 0.82199494],
# [ 0.98432414, 0.82199494, 1. ]])
Then the results of cov and cor are different. Why?
This is because numpy
calculates by row and R
by column. Either comment out X = np.transpose(X) # byrow=FALSE
, or use np.cov(X, rowvar=False)
.
np.cov(X, rowvar=False)
array([[ 1.75 , -1.75 , -1.5 ],
[-1.75 , 2.33333333, 3.66666667],
[-1.5 , 3.66666667, 9.33333333]])
The difference is explained in the respective documentation (emphasis mine):
Python:
help(np.cov)
rowvar : bool, optional
Ifrowvar
is True (default), then each row represents a
variable, with observations in the columns. Otherwise, the relationship
is transposed: each column represents a variable, while the rows
contain observations.
R:
?cov
var, cov and cor compute the variance of x and the covariance or
correlation of x and y if these are vectors. If x and y are matrices
then the covariances (or correlations) between the columns of x and
the columns of y are computed.
If I don’t transpose the array in Python, then I have exactly the same answer.
The covariance is computed by row (X[0]
returns the first row), and I suspect that R stores the data in Fortran order, whereas Python/Numpy uses C order. This explains the difference with the way mean
is computed, first axis is row in Python, not column.
You have to pass the transpose of the data matrix to numpy.cov() because numpy.cov() considers its input data matrix to have observations in each column, and variables in each row. As you can read from the documentation of np.cov() here:
https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.cov.html
Here in the code provided if you pass the Transposed matrix to np.cov() , you will get the same values as you are getting in R using cov().
not only that
the bias too… np uses 1/n-1, R probably the same… but in numpy you can set it to 1/n by putting the flag bias = FALSE… I am not sure how to do that in R.