How to get p-value and pearson's r for a list of columns in Pandas?

Question:

I’m trying to make a multiindexed table (a matrix) of correlation coefficients and p-values. I’d prefer to use the scipy.stats tests.

x = pd.DataFrame(
    list(
        zip(
            [1,2,3,4,5,6], [5, 7, 8, 4, 2, 8], [13, 16, 12, 11, 9, 10]
            )
            ),
            columns= ['a', 'b', 'c'] 
            )
 

# I've tried something like this
for i in range(len(x.columns)):
    r,p = pearsonr(x[x.columns[i]], x[x.columns[i+1]])
    print(f'{r}, {p}')

Obviously the for loop won’t work. What I want to end up with is:

a b c
a r 1.0 -.09 -.8
p .00 .87 .06
b r -.09 1 .42
p .87 .00 .41
c r -.8 .42 1
p .06 .41 00

I had written code to solve this problem (with help from this community) years ago, but it only worked for an older version of spearmanr.

Any help would be very much appreciated.

Asked By: KevOMalley743

||

Answers:

Here is one way to do it using scipy pearsonr and Pandas corr methods:

import pandas as pd
from scipy.stats import pearsonr

def pearsonr_pval(x, y):
    return pearsonr(x, y)[1]


df = (
    pd.concat(
        [
            x.corr(method="pearson").reset_index().assign(value="r"),
            x.corr(method=pearsonr_pval).reset_index().assign(value="p"),
        ]
    )
    .groupby(["index", "value"])
    .agg(lambda x: list(x)[0])
).sort_index(ascending=[True, False])

df.index.names = ["", ""]

Then:

print(df)
# Output
            a         b         c

a r  1.000000 -0.088273 -0.796421
  p  1.000000  0.867934  0.057948
b r -0.088273  1.000000  0.421184
  p  0.867934  1.000000  0.405583
c r -0.796421  0.421184  1.000000
  p  0.057948  0.405583  1.000000
Answered By: Laurent
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.