Weird behavior on String vs Categorical Dtypes
Question:
I´ve been facing a very weird, or could I say, a bug, when handling some categorical vs string Dtypes. Take a look at this simple example dataframe :
import pandas as pd
import numpy as np
data = pd.DataFrame({
'status' : ['pending', 'pending','pending', 'canceled','canceled','canceled', 'confirmed', 'confirmed','confirmed'],
'partner' : ['A', np.nan,'C', 'A',np.nan,'C', 'A', np.nan,'C'],
'product' : ['afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard'],
'brand' : ['brand_1', 'brand_2', 'brand_3','brand_1', 'brand_2', 'brand_3','brand_1', 'brand_2', 'brand_3'],
'gmv' : [100,100,100,100,100,100,100,100,100]})
data = data.astype({'partner':'category','status':'category','product':'category', 'brand':'category'})
When I execute a single Loc selection
test = data.loc[(data.partner !='A') | ((data.brand == 'A') & (data.status == 'confirmed'))]
and this is the output
Now, just move my categorical columns to string (I am moving it back to string due to an bug related to groupby issues regarding categorical as described here)
data = data.astype({'partner':'string','status':'string','product':'string', 'brand':'string'})
And lest make the same loc command.
test2 = data.loc[(data.partner !='A') | ((data.brand == 'A') & (data.status == 'confirmed'))]
but take a look at the output!
I am really lost why it does not work. I´ve figureout that is something related to the NAN categorical being sent back to strings, but I don´t see why would it be a problem.
Answers:
The problem is exactly with the difference between NAN value in category type and in string type:
With category the type of nan value is ‘float’ (as a typical nan) and you can use it in comparison, so:
data.partner !='A'
will be True for all the rows with NaN.
When converting the type to string, the type of nan value is converted to pandas._libs.missing.NAType, and you can’t use this in comparison, so now:
data.partner !='A'
returns <NaN> which is not True and the result differs.
Basically the NaN in category type is not a category in itself, so it is handled differently. This is why you can’t use fillna on categories as is, you have to define a category value for it.
you can use something like this:
data['partner'] = data.partner.cat.add_categories('Not_available').fillna('Not_available')
to add a custom NA category and replace the missing values. Now if convert to string and run the same conditions, you should get the same result.
I´ve been facing a very weird, or could I say, a bug, when handling some categorical vs string Dtypes. Take a look at this simple example dataframe :
import pandas as pd
import numpy as np
data = pd.DataFrame({
'status' : ['pending', 'pending','pending', 'canceled','canceled','canceled', 'confirmed', 'confirmed','confirmed'],
'partner' : ['A', np.nan,'C', 'A',np.nan,'C', 'A', np.nan,'C'],
'product' : ['afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard'],
'brand' : ['brand_1', 'brand_2', 'brand_3','brand_1', 'brand_2', 'brand_3','brand_1', 'brand_2', 'brand_3'],
'gmv' : [100,100,100,100,100,100,100,100,100]})
data = data.astype({'partner':'category','status':'category','product':'category', 'brand':'category'})
When I execute a single Loc selection
test = data.loc[(data.partner !='A') | ((data.brand == 'A') & (data.status == 'confirmed'))]
and this is the output
Now, just move my categorical columns to string (I am moving it back to string due to an bug related to groupby issues regarding categorical as described here)
data = data.astype({'partner':'string','status':'string','product':'string', 'brand':'string'})
And lest make the same loc command.
test2 = data.loc[(data.partner !='A') | ((data.brand == 'A') & (data.status == 'confirmed'))]
but take a look at the output!
I am really lost why it does not work. I´ve figureout that is something related to the NAN categorical being sent back to strings, but I don´t see why would it be a problem.
The problem is exactly with the difference between NAN value in category type and in string type:
With category the type of nan value is ‘float’ (as a typical nan) and you can use it in comparison, so:
data.partner !='A'
will be True for all the rows with NaN.
When converting the type to string, the type of nan value is converted to pandas._libs.missing.NAType, and you can’t use this in comparison, so now:
data.partner !='A'
returns <NaN> which is not True and the result differs.
Basically the NaN in category type is not a category in itself, so it is handled differently. This is why you can’t use fillna on categories as is, you have to define a category value for it.
you can use something like this:
data['partner'] = data.partner.cat.add_categories('Not_available').fillna('Not_available')
to add a custom NA category and replace the missing values. Now if convert to string and run the same conditions, you should get the same result.