Pandas DataFrame.corr() doesn't give same results as Series.corr()

Question

I have two timeseries

jpm  = pd.read_csv(...) # JPM GBI Global All Traded
msci = pd.read_csv(...) # MSCI WORLD U$

Together in a DataFrame they look like

df = jpm.merge(msci, how='outer', on='Date', sort=True)
df

	Date	JPM GBI Global All Traded	MSCI WORLD U$
0	1970-01-01	NaN	100.0
1	1970-01-02	NaN	100.0
2	1970-01-05	NaN	100.0
3	1970-01-06	NaN	100.0
4	1970-01-07	NaN	100.670
…	…	…	…
13838	2023-01-17	492.3360	2736.452
13839	2023-01-18	496.4402	2713.537
13840	2023-01-19	494.9905	2685.317
13841	2023-01-20	492.3206	2725.396
13842	2023-01-23	491.5816	2754.961

I want to compute the correlation between the two timeseries.

Using DataFrame.corr():

corr1 = df.corr(method='pearson', min_periods=1, numeric_only=True)
corr1

	JPM GBI Global All Traded	MSCI WORLD U$
JPM GBI Global All Traded	1.000000	0.849705
MSCI WORLD U$	0.849705	1.000000

Correlation is 0.849705

Using Series.corr():

s1 = jpm['JPM GBI Global All Traded']#.dropna()
s2 = msci['MSCI WORLD U$']#.dropna()
corr2 = s1.corr(s2, method='pearson', min_periods=1)
corr2

Correlation is 0.904641

As you can see the two correlations don’t match even though they should. And I’ve also tried applying the .dropna() function manually but it makes no difference.

And according to the pandas.DataFrame.corr documentation:

Compute pairwise correlation of columns, excluding NA/null values.

Is it a bug with my code or with pandas?

Asked By: Nermin

||

Source

Answer 1

I finally figured it out and it had to do with my indexes not being aligned.

Example using aligned indexes

df = pd.DataFrame({'A': [1,2,3,4], 'B': [1,1,2,4]})

s1 = df['A']
s2 = df['B']
df

DataFrame.corr():

corr1 = df.corr()
corr1

           A           B
A   1.000000    0.912871
B   0.912871    1.000000

Correlation is 0.912871

Series.corr():

corr2 = s1.corr(s2)
corr2

Correlation is 0.912871

Example using misaligned indexes

df = pd.DataFrame({'A': [1,2,3,4], 'B': [1,1,2,4]}, index=[1,2,3,4])

s1 = df['A'].reset_index(drop=True)
s2 = df['B']
df

DataFrame.corr():

corr1 = df.corr()
corr1

           A           B
A   1.000000    0.912871
B   0.912871    1.000000

Correlation is 0.912871

Series.corr():

corr2 = s1.corr(s2)
corr2

Correlation is 0.866025

The reason we get a different result in the second example is because pandas.Series.corr will only compute the correlation for rows with matching index.

Answered By: Nermin

Pandas DataFrame.corr() doesn't give same results as Series.corr()

Question:

Answers:

Example using aligned indexes

Example using misaligned indexes