Pandas DataFrame.corr() doesn't give same results as Series.corr()

Question:

I have two timeseries

jpm  = pd.read_csv(...) # JPM GBI Global All Traded
msci = pd.read_csv(...) # MSCI WORLD U$

Together in a DataFrame they look like

df = jpm.merge(msci, how='outer', on='Date', sort=True)
df
Date JPM GBI Global All Traded MSCI WORLD U$
0 1970-01-01 NaN 100.0
1 1970-01-02 NaN 100.0
2 1970-01-05 NaN 100.0
3 1970-01-06 NaN 100.0
4 1970-01-07 NaN 100.670
13838 2023-01-17 492.3360 2736.452
13839 2023-01-18 496.4402 2713.537
13840 2023-01-19 494.9905 2685.317
13841 2023-01-20 492.3206 2725.396
13842 2023-01-23 491.5816 2754.961

I want to compute the correlation between the two timeseries.

Using DataFrame.corr():

corr1 = df.corr(method='pearson', min_periods=1, numeric_only=True)
corr1
JPM GBI Global All Traded MSCI WORLD U$
JPM GBI Global All Traded 1.000000 0.849705
MSCI WORLD U$ 0.849705 1.000000

Correlation is 0.849705

Using Series.corr():

s1 = jpm['JPM GBI Global All Traded']#.dropna()
s2 = msci['MSCI WORLD U$']#.dropna()
corr2 = s1.corr(s2, method='pearson', min_periods=1)
corr2

Correlation is 0.904641


As you can see the two correlations don’t match even though they should. And I’ve also tried applying the .dropna() function manually but it makes no difference.

And according to the pandas.DataFrame.corr documentation:

Compute pairwise correlation of columns, excluding NA/null values.

Is it a bug with my code or with pandas?

Asked By: Nermin

||

Answers:

I finally figured it out and it had to do with my indexes not being aligned.

Example using aligned indexes

df = pd.DataFrame({'A': [1,2,3,4], 'B': [1,1,2,4]})

s1 = df['A']
s2 = df['B']
df
    A   B
0   1   1
1   2   1
2   3   2
3   4   4

DataFrame.corr():

corr1 = df.corr()
corr1
           A           B
A   1.000000    0.912871
B   0.912871    1.000000

Correlation is 0.912871

Series.corr():

corr2 = s1.corr(s2)
corr2

Correlation is 0.912871

Example using misaligned indexes

df = pd.DataFrame({'A': [1,2,3,4], 'B': [1,1,2,4]}, index=[1,2,3,4])

s1 = df['A'].reset_index(drop=True)
s2 = df['B']
df
    A   B
1   1   1
2   2   1
3   3   2
4   4   4

DataFrame.corr():

corr1 = df.corr()
corr1
           A           B
A   1.000000    0.912871
B   0.912871    1.000000

Correlation is 0.912871

Series.corr():

corr2 = s1.corr(s2)
corr2

Correlation is 0.866025


The reason we get a different result in the second example is because pandas.Series.corr will only compute the correlation for rows with matching index.

Answered By: Nermin
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.