How do I deterministically convert Pandas string columns into specific numbers?

Question:

I asked a similar question here and was grateful for the community’s help.

I have a problem wherein I want to convert the dataframe strings into numbers. This time I cannot manually map the strings to numbers as this column is quite long in practise (the example below is just a minimal example). The constraint is that every time the same string is repeated, the number should be the same.

I tried using pd.to_numeric but it gave me an error –

import pandas as pd
data = [['mechanical@engineer', 'Works on machines'], ['field engineer', 'Works on pumps'],
        ['lab_scientist', 'Publishes papers'], ['field engineer', 'Works on pumps'],
        ['lab_scientist','Publishes papers']]# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Job1', 'Description'])
role_to_code = {"mechanical@engineer": 0, "field engineer": 1, "lab_scientist": 2}

df['Job1'] = df['Job1'].map(role_to_code)

print(df.head())

df['Description'] = pd.to_numeric(df['Description'])

Here is the error –

ValueError: Unable to parse string "Works on machines" at position 0

The solution for the above error as per similar SO posts is to specify a separator. But as the dataset is quite big, I don’t want to specify multiple separators. Is there a way to automate this process?

Asked By: desert_ranger

||

Answers:

If you want to encode the strings in the ‘Description’ column as numbers, you can use scikit-learn’s LabelEncoder class, which will encode each unique string as a unique integer value.

from sklearn.preprocessing import LabelEncoder

# Create a LabelEncoder object
le = LabelEncoder()

# Fit the encoder on the 'Description' column and transform the column
df['Description'] = le.fit_transform(df['Description'])

Note that the integer values are arbitrary and don’t carry any specific meaning.

If you want to map specific strings to specific integer values, you can use a dictionary and the map method, just like you did with the ‘Job1’ column:

description_to_code = {"Works on machines": 0, "Works on pumps": 1, "Publishes papers": 2}
df['Description'] = df['Description'].map(description_to_code)
Answered By: Pablo Alaniz

IIUC, there is a pandas builtin to do that : factorize.

pandas.factorize(values, sort=False, use_na_sentinel=True,
size_hint=None)
     Encode the object as an enumerated type or categorical variable.

This method is useful for obtaining a numeric representation of an
array
when all that matters is identifying distinct values.

df["Description_new"] = pd.factorize(df['Description'])[0]

Output :

print(df)

   Job1        Description  Description_new
0     0  Works on machines                0
1     1     Works on pumps                1
2     2   Publishes papers                2
3     1     Works on pumps                1
4     2   Publishes papers                2
Answered By: Timeless
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.