Correlation between two non-numeric columns in a Pandas DataFrame

Question:

I get my data from an SQL query from the table to my pandas Dataframe. The data looks like:

        group  phone_brand
0      M32-38          小米
1      M32-38          小米
2      M32-38          小米
3      M29-31          小米
4      M29-31          小米
5      F24-26         OPPO
6      M32-38          酷派
7      M32-38          小米
8      M32-38         vivo
9      F33-42          三星
10     M29-31          华为
11     F33-42          华为
12     F27-28          三星
13     M32-38          华为
14       M39+         艾优尼
15     F27-28          华为
16     M32-38          小米
17     M32-38          小米
18       M39+          魅族
19     M32-38          小米
20     F33-42          三星
21     M23-26          小米
22     M23-26          华为
23     M27-28          三星
24     M29-31          小米
25     M32-38          三星
26     M32-38          三星
27     F33-42          三星
28     M32-38          三星
29     M32-38          三星
...       ...          ...
74809  M27-28          华为
74810  M29-31          TCL

Now I want to find the correlation and the frequency from these two columns and put this in a visualization with Matplotlib. I tried something like:

DataFrame.plot(style='o')
plt.show() 

Now how can I visualize this correlation in the simplest way?

Asked By: madik_atma

||

Answers:

To quickly get a correlation:

df.apply(lambda x: x.factorize()[0]).corr()

                group  phone_brand
group        1.000000     0.427941
phone_brand  0.427941     1.000000

Heat map

import seaborn as sns

sns.heatmap(pd.crosstab(df.group, df.phone_brand))

enter image description here

Answered By: piRSquared

Use pandas.factorize() method which can get the numeric representation of an array by identifying distinct values.

Answered By: A. Rehman

Apart from the method piRSquared very clearly explained, you can use LabelEncoder which transforms the values into numeric form in order to make sure that the machine interprets the features correctly.

#Import label encoder
from sklearn.preprocessing import LabelEncoder

#label_encoder object 
le = LabelEncoder()

#Fit label encoder and return encoded labels
df['group'] = le.fit_transform(df['group'])

df['phone_brand'] = le.fit_transform(df['phone_brand'] )

#Finding correlation
df.corr()

#output for first 10 rows

               group     phone_brand
      group  1.00000         0.67391
phone_brand  0.67391         1.00000

After applying LabelEncoder, our DataFrame converted from this

     group  phone_brand
0   M32-38          小米
1   M32-38          小米
2   M32-38          小米
3   M29-31          小米
4   M29-31          小米
5   F24-26         OPPO
6   M32-38          酷派
7   M32-38          小米
8   M32-38         vivo
9   F33-42          三星
10  M29-31          华为

to this

   group    phone_brand
0      3              4
1      3              4
2      3              4
3      2              4
4      2              4
5      0              0
6      3              5
7      3              4
8      3              1
9      1              2
10     2              3

For multiple columns, you can go through the answers.

Answered By: Hari Sharma