Weird behavior on String vs Categorical Dtypes

Question:

I´ve been facing a very weird, or could I say, a bug, when handling some categorical vs string Dtypes. Take a look at this simple example dataframe :

import pandas as pd
import numpy as np
data = pd.DataFrame({
    'status' :  ['pending', 'pending','pending', 'canceled','canceled','canceled', 'confirmed', 'confirmed','confirmed'],
    'partner' :  ['A', np.nan,'C', 'A',np.nan,'C', 'A', np.nan,'C'],
    'product' : ['afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard'],
    'brand' : ['brand_1', 'brand_2', 'brand_3','brand_1', 'brand_2', 'brand_3','brand_1', 'brand_2', 'brand_3'],
    'gmv' : [100,100,100,100,100,100,100,100,100]})

data = data.astype({'partner':'category','status':'category','product':'category', 'brand':'category'})

When I execute a single Loc selection

test = data.loc[(data.partner !='A') | ((data.brand == 'A') & (data.status == 'confirmed'))]

and this is the output

First output

Now, just move my categorical columns to string (I am moving it back to string due to an bug related to groupby issues regarding categorical as described here)

data = data.astype({'partner':'string','status':'string','product':'string', 'brand':'string'})

And lest make the same loc command.

test2 = data.loc[(data.partner !='A') | ((data.brand == 'A') & (data.status == 'confirmed'))]

but take a look at the output!

second output

I am really lost why it does not work. I´ve figureout that is something related to the NAN categorical being sent back to strings, but I don´t see why would it be a problem.

Asked By: FábioRB

||

Answers:

The problem is exactly with the difference between NAN value in category type and in string type:

With category the type of nan value is ‘float’ (as a typical nan) and you can use it in comparison, so:
data.partner !='A' will be True for all the rows with NaN.

When converting the type to string, the type of nan value is converted to pandas._libs.missing.NAType, and you can’t use this in comparison, so now:
data.partner !='A' returns <NaN> which is not True and the result differs.

Basically the NaN in category type is not a category in itself, so it is handled differently. This is why you can’t use fillna on categories as is, you have to define a category value for it.
you can use something like this:

data['partner'] = data.partner.cat.add_categories('Not_available').fillna('Not_available')

to add a custom NA category and replace the missing values. Now if convert to string and run the same conditions, you should get the same result.

Answered By: Yashar
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.