Pandas apply casts None dtype to object or float depending on other outputs

Question:

I would like to control the output dtypes for apply on a row. foo and bar below have multiple outputs.

import pandas as pd

def foo(x):
    return x['a'] * x['b'], None, x['a'] > x['b']

def bar(x):
    return x['a'] * x['b'], None

df = pd.DataFrame([{'a': 10, 'b': 2}, {'a': 10, 'b': 20}])
df2 = df.copy()
df[['product', 'dummy', 'greater']] = df.apply(foo, axis=1, result_type='expand')
df2[['product', 'dummy']] = df2.apply(bar, axis=1, result_type='expand')

The output dtypes are:

col df df2
a int64 int64
b int64 int64
product int64 float64
dummy object float64
greater bool

A comment to this question pandas apply changing dtype, suggests that apply returns a series with a single dtype. That may be the case with bar since the outputs can be cast to float. But it doesn’t seem to be the case for foo, because then the outputs would need to be object.

Is it possible to control the output dtypes of apply? I.e. get/specify the output dtypes (int, object) for bar, or do I need to cast the dtype at the end?

Background:
I have a dataframe where the dummy column has values True, False and None and dtype ‘object’. The apply function runs on some corner cases, and introduces NaN instead of None. I’m replacing the NaN with None after apply, but it seems overly complicated.

pandas version 1.5.2

Asked By: Frank_Coumans

||

Answers:

IIUC, you’re asking why product and dummy have different dtypes after applying foo and bar even though the values returned by those functions are the same for those new columns ?

       col      df      df2
0        a   int64    int64
1        b   int64    int64
2  product   int64  float64  # int64  <> float64
3    dummy  object  float64  # object <> float64
4  greater    bool         

If so, that’s because when result_type == "expand", there is a specific transformation done behind the scenes with infer_to_same_shape, which is roughly equivalent to this :

_datafoo = {0: (20, None, True), 1: (200, None, False)}
_databar = {0: (20, None), 1: (200, None)}

expandfoo = pd.DataFrame(_datafoo).T.set_axis(df.index).infer_objects()
expandbar = pd.DataFrame(_databar).T.set_axis(df.index).infer_objects()

Output (foo) :

print(expandfoo.T, expandfoo, expandfoo.dtypes.to_dict(), sep="n"*2)

      0      1
0    20    200
1  None   None
2  True  False

     0     1      2
0   20  None   True
1  200  None  False

{0: dtype('int64'), 1: dtype('O'), 2: dtype('bool')}

Output (bar) :

print(expandbar.T, expandbar, expandbar.dtypes.to_dict(), sep="n"*2)

      A      B
0  20.0  200.0
1   NaN    NaN  # <-- see the presence of NaN

       0   1
0   20.0 NaN
1  200.0 NaN

{0: dtype('float64'), 1: dtype('float64')}

As you can see, infer_objects keeps expandbar inferred as float64 for both columns (if this is unintuitive, see GH28318).


Is it possible to control the output dtypes of apply ?

That depends on the computation made by the applied function and the values returned. So yes, you have somehow this kind of control but you can always add convert_dtypes or astype at the end.

Answered By: Timeless