How do I deterministically convert Pandas string columns into specific numbers?
Question:
I asked a similar question here and was grateful for the community’s help.
I have a problem wherein I want to convert the dataframe strings into numbers. This time I cannot manually map the strings to numbers as this column is quite long in practise (the example below is just a minimal example). The constraint is that every time the same string is repeated, the number should be the same.
I tried using pd.to_numeric
but it gave me an error –
import pandas as pd
data = [['mechanical@engineer', 'Works on machines'], ['field engineer', 'Works on pumps'],
['lab_scientist', 'Publishes papers'], ['field engineer', 'Works on pumps'],
['lab_scientist','Publishes papers']]# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Job1', 'Description'])
role_to_code = {"mechanical@engineer": 0, "field engineer": 1, "lab_scientist": 2}
df['Job1'] = df['Job1'].map(role_to_code)
print(df.head())
df['Description'] = pd.to_numeric(df['Description'])
Here is the error –
ValueError: Unable to parse string "Works on machines" at position 0
The solution for the above error as per similar SO posts is to specify a separator. But as the dataset is quite big, I don’t want to specify multiple separators. Is there a way to automate this process?
Answers:
If you want to encode the strings in the ‘Description’ column as numbers, you can use scikit-learn’s LabelEncoder class, which will encode each unique string as a unique integer value.
from sklearn.preprocessing import LabelEncoder
# Create a LabelEncoder object
le = LabelEncoder()
# Fit the encoder on the 'Description' column and transform the column
df['Description'] = le.fit_transform(df['Description'])
Note that the integer values are arbitrary and don’t carry any specific meaning.
If you want to map specific strings to specific integer values, you can use a dictionary and the map method, just like you did with the ‘Job1’ column:
description_to_code = {"Works on machines": 0, "Works on pumps": 1, "Publishes papers": 2}
df['Description'] = df['Description'].map(description_to_code)
IIUC, there is a pandas builtin to do that : factorize
.
pandas.factorize
(values, sort=False, use_na_sentinel=True,
size_hint=None)
Encode the object as an enumerated type or categorical variable.
This method is useful for obtaining a numeric representation of an
array when all that matters is identifying distinct values.
df["Description_new"] = pd.factorize(df['Description'])[0]
Output :
print(df)
Job1 Description Description_new
0 0 Works on machines 0
1 1 Works on pumps 1
2 2 Publishes papers 2
3 1 Works on pumps 1
4 2 Publishes papers 2
I asked a similar question here and was grateful for the community’s help.
I have a problem wherein I want to convert the dataframe strings into numbers. This time I cannot manually map the strings to numbers as this column is quite long in practise (the example below is just a minimal example). The constraint is that every time the same string is repeated, the number should be the same.
I tried using pd.to_numeric
but it gave me an error –
import pandas as pd
data = [['mechanical@engineer', 'Works on machines'], ['field engineer', 'Works on pumps'],
['lab_scientist', 'Publishes papers'], ['field engineer', 'Works on pumps'],
['lab_scientist','Publishes papers']]# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Job1', 'Description'])
role_to_code = {"mechanical@engineer": 0, "field engineer": 1, "lab_scientist": 2}
df['Job1'] = df['Job1'].map(role_to_code)
print(df.head())
df['Description'] = pd.to_numeric(df['Description'])
Here is the error –
ValueError: Unable to parse string "Works on machines" at position 0
The solution for the above error as per similar SO posts is to specify a separator. But as the dataset is quite big, I don’t want to specify multiple separators. Is there a way to automate this process?
If you want to encode the strings in the ‘Description’ column as numbers, you can use scikit-learn’s LabelEncoder class, which will encode each unique string as a unique integer value.
from sklearn.preprocessing import LabelEncoder
# Create a LabelEncoder object
le = LabelEncoder()
# Fit the encoder on the 'Description' column and transform the column
df['Description'] = le.fit_transform(df['Description'])
Note that the integer values are arbitrary and don’t carry any specific meaning.
If you want to map specific strings to specific integer values, you can use a dictionary and the map method, just like you did with the ‘Job1’ column:
description_to_code = {"Works on machines": 0, "Works on pumps": 1, "Publishes papers": 2}
df['Description'] = df['Description'].map(description_to_code)
IIUC, there is a pandas builtin to do that : factorize
.
pandas.factorize
(values, sort=False, use_na_sentinel=True,
size_hint=None)
Encode the object as an enumerated type or categorical variable.This method is useful for obtaining a numeric representation of an
array when all that matters is identifying distinct values.
df["Description_new"] = pd.factorize(df['Description'])[0]
Output :
print(df)
Job1 Description Description_new
0 0 Works on machines 0
1 1 Works on pumps 1
2 2 Publishes papers 2
3 1 Works on pumps 1
4 2 Publishes papers 2