Pandas: convert categories to numbers
Question:
Suppose I have a dataframe with countries that goes as:
cc | temp
US | 37.0
CA | 12.0
US | 35.0
AU | 20.0
I know that there is a pd.get_dummies function to convert the countries to ‘one-hot encodings’. However, I wish to convert them to indices instead such that I will get cc_index = [1,2,1,3]
instead.
I’m assuming that there is a faster way than using the get_dummies along with a numpy where clause as shown below:
[np.where(x) for x in df.cc.get_dummies().values]
This is somewhat easier to do in R using ‘factors’ so I’m hoping pandas has something similar.
Answers:
First, change the type of the column:
df.cc = pd.Categorical(df.cc)
Now the data look similar but are stored categorically. To capture the category codes:
df['code'] = df.cc.cat.codes
Now you have:
cc temp code
0 US 37.0 2
1 CA 12.0 1
2 US 35.0 2
3 AU 20.0 0
If you don’t want to modify your DataFrame but simply get the codes:
df.cc.astype('category').cat.codes
Or use the categorical column as an index:
df2 = pd.DataFrame(df.temp)
df2.index = pd.CategoricalIndex(df.cc)
If you wish only to transform your series into integer identifiers, you can use pd.factorize
.
Note this solution, unlike pd.Categorical
, will not sort alphabetically. So the first country will be assigned 0
. If you wish to start from 1
, you can add a constant:
df['code'] = pd.factorize(df['cc'])[0] + 1
print(df)
cc temp code
0 US 37.0 1
1 CA 12.0 2
2 US 35.0 1
3 AU 20.0 3
If you wish to sort alphabetically, specify sort=True
:
df['code'] = pd.factorize(df['cc'], sort=True)[0] + 1
If you are using the sklearn
library you can use LabelEncoder
. Like pd.Categorical
, input strings are sorted alphabetically before encoding.
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
df['code'] = LE.fit_transform(df['cc'])
print(df)
cc temp code
0 US 37.0 2
1 CA 12.0 1
2 US 35.0 2
3 AU 20.0 0
Will change any columns into Numbers. It will not create a new column but just replace the values with numerical data.
def characters_to_numb(*args):
for arg in args:
df[arg] = pd.Categorical(df[arg])
df[arg] = df[arg].cat.codes
return df
Try this, convert to number based on frequency (high frequency – high number):
labels = df[col].value_counts(ascending=True).index.tolist()
codes = range(1,len(labels)+1)
df[col].replace(labels,codes,inplace=True)
One-line code:
df[['cc']] = df[['cc']].apply(lambda col:pd.Categorical(col).codes)
This works also if you have a list_of_columns
:
df[list_of_columns] = df[list_of_columns].apply(lambda col:pd.Categorical(col).codes)
Furthermore, if you want to keep your NaN
values you can apply a replace:
df[['cc']] = df[['cc']].apply(lambda col:pd.Categorical(col).codes).replace(-1,np.nan)
Suppose I have a dataframe with countries that goes as:
cc | temp
US | 37.0
CA | 12.0
US | 35.0
AU | 20.0
I know that there is a pd.get_dummies function to convert the countries to ‘one-hot encodings’. However, I wish to convert them to indices instead such that I will get cc_index = [1,2,1,3]
instead.
I’m assuming that there is a faster way than using the get_dummies along with a numpy where clause as shown below:
[np.where(x) for x in df.cc.get_dummies().values]
This is somewhat easier to do in R using ‘factors’ so I’m hoping pandas has something similar.
First, change the type of the column:
df.cc = pd.Categorical(df.cc)
Now the data look similar but are stored categorically. To capture the category codes:
df['code'] = df.cc.cat.codes
Now you have:
cc temp code
0 US 37.0 2
1 CA 12.0 1
2 US 35.0 2
3 AU 20.0 0
If you don’t want to modify your DataFrame but simply get the codes:
df.cc.astype('category').cat.codes
Or use the categorical column as an index:
df2 = pd.DataFrame(df.temp)
df2.index = pd.CategoricalIndex(df.cc)
If you wish only to transform your series into integer identifiers, you can use pd.factorize
.
Note this solution, unlike pd.Categorical
, will not sort alphabetically. So the first country will be assigned 0
. If you wish to start from 1
, you can add a constant:
df['code'] = pd.factorize(df['cc'])[0] + 1
print(df)
cc temp code
0 US 37.0 1
1 CA 12.0 2
2 US 35.0 1
3 AU 20.0 3
If you wish to sort alphabetically, specify sort=True
:
df['code'] = pd.factorize(df['cc'], sort=True)[0] + 1
If you are using the sklearn
library you can use LabelEncoder
. Like pd.Categorical
, input strings are sorted alphabetically before encoding.
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
df['code'] = LE.fit_transform(df['cc'])
print(df)
cc temp code
0 US 37.0 2
1 CA 12.0 1
2 US 35.0 2
3 AU 20.0 0
Will change any columns into Numbers. It will not create a new column but just replace the values with numerical data.
def characters_to_numb(*args):
for arg in args:
df[arg] = pd.Categorical(df[arg])
df[arg] = df[arg].cat.codes
return df
Try this, convert to number based on frequency (high frequency – high number):
labels = df[col].value_counts(ascending=True).index.tolist()
codes = range(1,len(labels)+1)
df[col].replace(labels,codes,inplace=True)
One-line code:
df[['cc']] = df[['cc']].apply(lambda col:pd.Categorical(col).codes)
This works also if you have a list_of_columns
:
df[list_of_columns] = df[list_of_columns].apply(lambda col:pd.Categorical(col).codes)
Furthermore, if you want to keep your NaN
values you can apply a replace:
df[['cc']] = df[['cc']].apply(lambda col:pd.Categorical(col).codes).replace(-1,np.nan)