Redefine categories of a categorical variable ignoring upper and lower case

Question

I have a dataset with a categorical variable that is not nicely coded. The same category appears sometimes with upper case letters and sometimes with lower case (and several variations of it). Since I have a large dataset, I would like to harmonize the categories taking advantage of the categorical dtype – therefore exclude any replace solution. The only solutions I found are this and this, but I feel they implicitly make use of replace.

I report a toy example below and the solutions I tried

from pandas import Series

# Create dataset
df = Series(["male", "female","Male", "FEMALE", "MALE", "MAle"], dtype="category", name = "NEW_TEST")

# Define the old, the "new" and the desired categories
original_categories = list(df.cat.categories)
standardised_categories = list(map(lambda x: x.lower(), df.cat.categories)) 
desired_new_cat = list(set(standardised_categories))

# Failed attempt to change categories   
df.cat.categories = standardised_categories
df = df.cat.rename_categories(standardised_categories)
# Error message: Categorical categories must be unique

Asked By: Daniele Mauriello

||

Source

Answer 1

You shouldn’t try to harmonize after converting to category. This renders the use of a Category pointless as one category per exact string will be created.

You can instead harmonize the case with str.capitalize, then convert to categorical:

s = (pd.Series(["male", "female","Male", "FEMALE", "MALE", "MAle"],
               name = "NEW_TEST")
       .str.capitalize().astype('category')
     )

If you already have a category, convert back to string and start over:

s = s.astype(str).str.capitalize().astype('category')

Output:

0      Male
1    Female
2      Male
3    Female
4      Male
5      Male
Name: NEW_TEST, dtype: category
Categories (2, object): ['Female', 'Male']

Answered By: mozway

Answer 2

Given the Series df that OP creates in the code sample shared in the question, one can approach would be to use pandas.Series.str.lower as .astype("category") as follows

df = df.str.lower().astype("category")

[Out]:

0      male
1    female
2      male
3    female
4      male
5      male

If one prints the dtype, one gets

CategoricalDtype(categories=['female', 'male'], ordered=False)

Answered By: Gonçalo Peres

Redefine categories of a categorical variable ignoring upper and lower case

Question:

Answers: