Use .corr to get the correlation between two columns

Question:

I have the following pandas dataframe Top15:
enter image description here

I create a column that estimates the number of citable documents per person:

Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']
Top15['Citable docs per Capita'] = Top15['Citable documents'] / Top15['PopEst']

I want to know the correlation between the number of citable documents per capita and the energy supply per capita. So I use the .corr() method (Pearson’s correlation):

data = Top15[['Citable docs per Capita','Energy Supply per Capita']]
correlation = data.corr(method='pearson')

I want to return a single number, but the result is:
enter image description here

Asked By: tong zhu

||

Answers:

Without actual data it is hard to answer the question but I guess you are looking for something like this:

Top15['Citable docs per Capita'].corr(Top15['Energy Supply per Capita'])

That calculates the correlation between your two columns 'Citable docs per Capita' and 'Energy Supply per Capita'.

To give an example:

import pandas as pd

df = pd.DataFrame({'A': range(4), 'B': [2*i for i in range(4)]})

   A  B
0  0  0
1  1  2
2  2  4
3  3  6

Then

df['A'].corr(df['B'])

gives 1 as expected.

Now, if you change a value, e.g.

df.loc[2, 'B'] = 4.5

   A    B
0  0  0.0
1  1  2.0
2  2  4.5
3  3  6.0

the command

df['A'].corr(df['B'])

returns

0.99586

which is still close to 1, as expected.

If you apply .corr() directly to your dataframe, it will return all pairwise correlations between your columns; that’s why you then observe 1s at the diagonal of your matrix (each column is perfectly correlated with itself).

df.corr()

will therefore return

          A         B
A  1.000000  0.995862
B  0.995862  1.000000

In the graphic you show, only the upper left corner of the correlation matrix is represented (I assume).

There can be cases, where you get NaNs in your solution – check this post for an example.

If you want to filter entries above/below a certain threshold, you can check this question.
If you want to plot a heatmap of the correlation coefficients, you can check this answer and if you then run into the issue with overlapping axis-labels check the following post.

Answered By: Cleb

I ran into the same issue.
It appeared Citable Documents per Person was a float, and python skips it somehow by default. All the other columns of my dataframe were in numpy-formats, so I solved it by converting the columnt to np.float64

Top15['Citable Documents per Person']=np.float64(Top15['Citable Documents per Person'])

Remember it’s exactly the column you calculated yourself

Answered By: Gary

If you want the correlations between all pairs of columns, you could do something like this:

import pandas as pd
import numpy as np

def get_corrs(df):
    col_correlations = df.corr()
    col_correlations.loc[:, :] = np.tril(col_correlations, k=-1)
    cor_pairs = col_correlations.stack()
    return cor_pairs.to_dict()

my_corrs = get_corrs(df)
# and the following line to retrieve the single correlation
print(my_corrs[('Citable docs per Capita','Energy Supply per Capita')])
Answered By: mgoldwasser

I solved this problem by changing the data type. If you see the ‘Energy Supply per Capita’ is a numerical type while the ‘Citable docs per Capita’ is an object type. I converted the column to float using astype. I had the same problem with some np functions: count_nonzero and sum worked while mean and std didn’t.

Answered By: BID

When you call this:

data = Top15[['Citable docs per Capita','Energy Supply per Capita']]
correlation = data.corr(method='pearson')

Since, DataFrame.corr() function performs pair-wise correlations, you have four pair from two variables. So, basically you are getting diagonal values as auto correlation (correlation with itself, two values since you have two variables), and other two values as cross correlations of one vs another and vice versa.

Either perform correlation between two series to get a single value:

from scipy.stats.stats import pearsonr
docs_col = Top15['Citable docs per Capita'].values
energy_col = Top15['Energy Supply per Capita'].values
corr , _ = pearsonr(docs_col, energy_col)

or,
if you want a single value from the same function (DataFrame’s corr):

single_value = correlation[0][1] 

Hope this helps.

Answered By: aumpen

My solution would be after converting data to numerical type:

Top15[['Citable docs per Capita','Energy Supply per Capita']].corr()
Answered By: ibozkurt79

It works like this:

Top15['Citable docs per Capita']=np.float64(Top15['Citable docs per Capita'])

Top15['Energy Supply per Capita']=np.float64(Top15['Energy Supply per Capita'])

Top15['Energy Supply per Capita'].corr(Top15['Citable docs per Capita'])
Answered By: Orca

changing ‘Citable docs per Capita’ to numeric before correlation will solve the problem.

    Top15['Citable docs per Capita'] = pd.to_numeric(Top15['Citable docs per Capita'])
    data = Top15[['Citable docs per Capita','Energy Supply per Capita']]
    correlation = data.corr(method='pearson')
Answered By: moinul hossain

The following works for me. Taking the correlation matrix, then filter based on variable names:

cor_df = df.corr()  # take the correlation from the data
cor_df.loc['Citable docs per Capita','Energy Supply per Capita'] # only single value

if you put variables in [], it return variable names as well:

cor_df.loc[['Citable docs per Capita'],['Energy Supply per Capita']]
Answered By: Sam S.

If you want a quick way of finding and plotting the correlation between all the columns with respect to a single column (‘Energy Supply per Capita’, in this case):

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
cor = []
plt.figure()
for i in data.columns:
    try:
        cor +=  [np.corrcoef(data[i], data['Energy Supply per Capita'])[0][1]]
        plt.scatter(i, np.corrcoef(data[i], data['Energy Supply per Capita'])[0][1])
        plt.axvline(x = i, color='k', linestyle='-',alpha=0.2)
    except:
        pass
plt.ylabel('correlation')
plt.xticks(rotation = 90) 
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.