Pandas DataFrame.corr() doesn't give same results as Series.corr()
Question:
I have two timeseries
jpm = pd.read_csv(...) # JPM GBI Global All Traded
msci = pd.read_csv(...) # MSCI WORLD U$
Together in a DataFrame they look like
df = jpm.merge(msci, how='outer', on='Date', sort=True)
df
Date
JPM GBI Global All Traded
MSCI WORLD U$
0
1970-01-01
NaN
100.0
1
1970-01-02
NaN
100.0
2
1970-01-05
NaN
100.0
3
1970-01-06
NaN
100.0
4
1970-01-07
NaN
100.670
…
…
…
…
13838
2023-01-17
492.3360
2736.452
13839
2023-01-18
496.4402
2713.537
13840
2023-01-19
494.9905
2685.317
13841
2023-01-20
492.3206
2725.396
13842
2023-01-23
491.5816
2754.961
I want to compute the correlation between the two timeseries.
Using DataFrame.corr():
corr1 = df.corr(method='pearson', min_periods=1, numeric_only=True)
corr1
JPM GBI Global All Traded
MSCI WORLD U$
JPM GBI Global All Traded
1.000000
0.849705
MSCI WORLD U$
0.849705
1.000000
Correlation is 0.849705
Using Series.corr():
s1 = jpm['JPM GBI Global All Traded']#.dropna()
s2 = msci['MSCI WORLD U$']#.dropna()
corr2 = s1.corr(s2, method='pearson', min_periods=1)
corr2
Correlation is 0.904641
As you can see the two correlations don’t match even though they should. And I’ve also tried applying the .dropna() function manually but it makes no difference.
And according to the pandas.DataFrame.corr documentation:
Compute pairwise correlation of columns, excluding NA/null values.
Is it a bug with my code or with pandas?
Answers:
I finally figured it out and it had to do with my indexes not being aligned.
Example using aligned indexes
df = pd.DataFrame({'A': [1,2,3,4], 'B': [1,1,2,4]})
s1 = df['A']
s2 = df['B']
df
A B
0 1 1
1 2 1
2 3 2
3 4 4
DataFrame.corr():
corr1 = df.corr()
corr1
A B
A 1.000000 0.912871
B 0.912871 1.000000
Correlation is 0.912871
Series.corr():
corr2 = s1.corr(s2)
corr2
Correlation is 0.912871
Example using misaligned indexes
df = pd.DataFrame({'A': [1,2,3,4], 'B': [1,1,2,4]}, index=[1,2,3,4])
s1 = df['A'].reset_index(drop=True)
s2 = df['B']
df
A B
1 1 1
2 2 1
3 3 2
4 4 4
DataFrame.corr():
corr1 = df.corr()
corr1
A B
A 1.000000 0.912871
B 0.912871 1.000000
Correlation is 0.912871
Series.corr():
corr2 = s1.corr(s2)
corr2
Correlation is 0.866025
The reason we get a different result in the second example is because pandas.Series.corr will only compute the correlation for rows with matching index.
I have two timeseries
jpm = pd.read_csv(...) # JPM GBI Global All Traded
msci = pd.read_csv(...) # MSCI WORLD U$
Together in a DataFrame they look like
df = jpm.merge(msci, how='outer', on='Date', sort=True)
df
Date | JPM GBI Global All Traded | MSCI WORLD U$ | |
---|---|---|---|
0 | 1970-01-01 | NaN | 100.0 |
1 | 1970-01-02 | NaN | 100.0 |
2 | 1970-01-05 | NaN | 100.0 |
3 | 1970-01-06 | NaN | 100.0 |
4 | 1970-01-07 | NaN | 100.670 |
… | … | … | … |
13838 | 2023-01-17 | 492.3360 | 2736.452 |
13839 | 2023-01-18 | 496.4402 | 2713.537 |
13840 | 2023-01-19 | 494.9905 | 2685.317 |
13841 | 2023-01-20 | 492.3206 | 2725.396 |
13842 | 2023-01-23 | 491.5816 | 2754.961 |
I want to compute the correlation between the two timeseries.
Using DataFrame.corr():
corr1 = df.corr(method='pearson', min_periods=1, numeric_only=True)
corr1
JPM GBI Global All Traded | MSCI WORLD U$ | |
---|---|---|
JPM GBI Global All Traded | 1.000000 | 0.849705 |
MSCI WORLD U$ | 0.849705 | 1.000000 |
Correlation is 0.849705
Using Series.corr():
s1 = jpm['JPM GBI Global All Traded']#.dropna()
s2 = msci['MSCI WORLD U$']#.dropna()
corr2 = s1.corr(s2, method='pearson', min_periods=1)
corr2
Correlation is 0.904641
As you can see the two correlations don’t match even though they should. And I’ve also tried applying the .dropna() function manually but it makes no difference.
And according to the pandas.DataFrame.corr documentation:
Compute pairwise correlation of columns, excluding NA/null values.
Is it a bug with my code or with pandas?
I finally figured it out and it had to do with my indexes not being aligned.
Example using aligned indexes
df = pd.DataFrame({'A': [1,2,3,4], 'B': [1,1,2,4]})
s1 = df['A']
s2 = df['B']
df
A B
0 1 1
1 2 1
2 3 2
3 4 4
DataFrame.corr():
corr1 = df.corr()
corr1
A B
A 1.000000 0.912871
B 0.912871 1.000000
Correlation is 0.912871
Series.corr():
corr2 = s1.corr(s2)
corr2
Correlation is 0.912871
Example using misaligned indexes
df = pd.DataFrame({'A': [1,2,3,4], 'B': [1,1,2,4]}, index=[1,2,3,4])
s1 = df['A'].reset_index(drop=True)
s2 = df['B']
df
A B
1 1 1
2 2 1
3 3 2
4 4 4
DataFrame.corr():
corr1 = df.corr()
corr1
A B
A 1.000000 0.912871
B 0.912871 1.000000
Correlation is 0.912871
Series.corr():
corr2 = s1.corr(s2)
corr2
Correlation is 0.866025
The reason we get a different result in the second example is because pandas.Series.corr will only compute the correlation for rows with matching index.