Why Pandas fillna method does not work with pd.NA values?

Question:

According to the reference page of Pandas.DataFrame.fillna, all NA/NaN values are filled using the specified method.

However, in presence of pd.NA values it does not work.

As you can see in the following code block in fact, if I want to replace missing booleans (marked with the pd.NA values) with the column’s mode it does not work:

import pandas as pd
import numpy as np

# create dataframe
df = pd.DataFrame({"a": [True, pd.NA, False, True], "b": [0, np.nan, 2, 3]})

# convert types (a becomes boolean, b becomes Int64)
df = df.convert_dtypes()

# get boolean columns
bool_cols = df.select_dtypes(include=bool).columns.tolist()

# get most frequent values
most_frequent_values = df[bool_cols].mode()

# replace missing content with column's mode
df[bool_cols] = df[bool_cols].fillna(most_frequent_values)

# print
print(df)

This is the current output:

id a b
0 True 0
1
2 False 2
3 True 3

while this is the expected output:

id a b
0 True 0
1 True
2 False 2
3 True 3

What am I missing? Should I convert all pd.NA in NaNs?

Side note: My Pandas version is 1.5.2

Asked By: Flavio

||

Answers:

The issue is that mode doesn’t return a single value but a 2D output.

You need to change:

most_frequent_values = df[bool_cols].mode().loc[0] # take the first mode

# then fillna
df[bool_cols] = df[bool_cols].fillna(most_frequent_values)

Then the output is correct:

       a     b
0   True     0
1   True  <NA>
2  False     2
3   True     3
Answered By: mozway
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.