Correlation among multiple categorical variables (Pandas)
Question:
I have a data set made of 22 categorical variables (non-ordered). I would like to visualize their correlation in a nice heatmap. Since the Pandas built-in function
DataFrame.corr(method='pearson', min_periods=1)
only implement correlation coefficients for numerical variables (Pearson, Kendall, Spearman), I have to aggregate it myself to perform a chi-square or something like it and I am not quite sure which function use to do it in one elegant step (rather than iterating through all the cat1*cat2 pairs). To be clear, this is what I would like to end up with (a dataframe):
cat1 cat2 cat3
cat1| coef coef coef
cat2| coef coef coef
cat3| coef coef coef
Any ideas with pd.pivot_table or something in the same vein?
thanks in advance
D.
Answers:
You can using pd.factorize
df.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1)
Out[32]:
a c d
a 1.0 1.0 1.0
c 1.0 1.0 1.0
d 1.0 1.0 1.0
Data input
df=pd.DataFrame({'a':['a','b','c'],'c':['a','b','c'],'d':['a','b','c']})
Update
from scipy.stats import chisquare
df=df.apply(lambda x : pd.factorize(x)[0])+1
pd.DataFrame([chisquare(df[x].values,f_exp=df.values.T,axis=1)[0] for x in df])
Out[123]:
0 1 2 3
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
df=pd.DataFrame({'a':['a','d','c'],'c':['a','b','c'],'d':['a','b','c'],'e':['a','b','c']})
Turns out, the only solution I found is to iterate trough all the factor*factor pairs.
factors_paired = [(i,j) for i in df.columns.values for j in df.columns.values]
chi2, p_values =[], []
for f in factors_paired:
if f[0] != f[1]:
chitest = chi2_contingency(pd.crosstab(df[f[0]], df[f[1]]))
chi2.append(chitest[0])
p_values.append(chitest[1])
else: # for same factor pair
chi2.append(0)
p_values.append(0)
chi2 = np.array(chi2).reshape((23,23)) # shape it as a matrix
chi2 = pd.DataFrame(chi2, index=df.columns.values, columns=df.columns.values) # then a df for convenience
Using association-metrics python package to calculate Cramér’s coefficient matrix from a pandas.DataFrame object it’s quite simple; let me show you:
First install association_metrics using:
pip install association-metrics
Then, you can use the following pseudocode
# Import association_metrics
import association_metrics as am
# Convert you str columns to Category columns
df = df.apply(
lambda x: x.astype("category") if x.dtype == "O" else x)
# Initialize a CamresV object using you pandas.DataFrame
cramersv = am.CramersV(df)
# will return a pairwise matrix filled with Cramer's V, where columns and index are
# the categorical variables of the passed pandas.DataFrame
cramersv.fit()
I have a data set made of 22 categorical variables (non-ordered). I would like to visualize their correlation in a nice heatmap. Since the Pandas built-in function
DataFrame.corr(method='pearson', min_periods=1)
only implement correlation coefficients for numerical variables (Pearson, Kendall, Spearman), I have to aggregate it myself to perform a chi-square or something like it and I am not quite sure which function use to do it in one elegant step (rather than iterating through all the cat1*cat2 pairs). To be clear, this is what I would like to end up with (a dataframe):
cat1 cat2 cat3
cat1| coef coef coef
cat2| coef coef coef
cat3| coef coef coef
Any ideas with pd.pivot_table or something in the same vein?
thanks in advance
D.
You can using pd.factorize
df.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1)
Out[32]:
a c d
a 1.0 1.0 1.0
c 1.0 1.0 1.0
d 1.0 1.0 1.0
Data input
df=pd.DataFrame({'a':['a','b','c'],'c':['a','b','c'],'d':['a','b','c']})
Update
from scipy.stats import chisquare
df=df.apply(lambda x : pd.factorize(x)[0])+1
pd.DataFrame([chisquare(df[x].values,f_exp=df.values.T,axis=1)[0] for x in df])
Out[123]:
0 1 2 3
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
df=pd.DataFrame({'a':['a','d','c'],'c':['a','b','c'],'d':['a','b','c'],'e':['a','b','c']})
Turns out, the only solution I found is to iterate trough all the factor*factor pairs.
factors_paired = [(i,j) for i in df.columns.values for j in df.columns.values]
chi2, p_values =[], []
for f in factors_paired:
if f[0] != f[1]:
chitest = chi2_contingency(pd.crosstab(df[f[0]], df[f[1]]))
chi2.append(chitest[0])
p_values.append(chitest[1])
else: # for same factor pair
chi2.append(0)
p_values.append(0)
chi2 = np.array(chi2).reshape((23,23)) # shape it as a matrix
chi2 = pd.DataFrame(chi2, index=df.columns.values, columns=df.columns.values) # then a df for convenience
Using association-metrics python package to calculate Cramér’s coefficient matrix from a pandas.DataFrame object it’s quite simple; let me show you:
First install association_metrics using:
pip install association-metrics
Then, you can use the following pseudocode
# Import association_metrics
import association_metrics as am
# Convert you str columns to Category columns
df = df.apply(
lambda x: x.astype("category") if x.dtype == "O" else x)
# Initialize a CamresV object using you pandas.DataFrame
cramersv = am.CramersV(df)
# will return a pairwise matrix filled with Cramer's V, where columns and index are
# the categorical variables of the passed pandas.DataFrame
cramersv.fit()