dropna=True behaviour in pandas.DataFrame.groupby and pandas.DataFrame.pivot_table

Question:

I’m trying to inspect the behaviour of the pandas.DataFrame.groupby and pandas.DataFrame.pivot_table methods and I’ve come up to this difference which I can’t explain by myself.

It seems that the specification of dropna=True (default for both) has different consequences in the two cases, which might be somehow enforced by the different descriptions which are given within the docs.

For pandas.DataFrame.pivot_table:

dropna: bool, default True

Do not include columns whose entries are all NaN.

For pandas.DataFrame.groupby:

dropna: bool, default True

If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.

This said, while I can totally understand the description given for the .pivot_table() method looking at the example I’ll show in a while, I can’t get through the nuances of the dropna behaviour in .groupby().

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'age': [31, np.nan, 28, 22, 54, np.nan, 49, 60, 25, np.nan],
    'country_live': ['Italy', 'Spain', 'Italy', 'Spain', 'France', 'Italy', 'Spain', 'Spain', 'France', 'Spain'],
    'employment_status': ['Fully employed by a company / organization', 'Partially employed by a company / organization',
    'Working student', 'Working student', 'Fully employed by a company / organization', 'Partially employed by a company / organization',
    'Fully employed by a company / organization', 'Fully employed by a company / organization', 'Working student',
    'Partially employed by a company / organization']
    },
)

df = df.assign(age=lambda t: t['age'].astype('Int64'), 
    country_live=lambda t: t['country_live'].astype('category'), 
    employment_status=lambda t: t['employment_status'].astype('category'))

With .pivot_table():

df.pivot_table(index='country_live', columns='employment_status', values='age', aggfunc='mean', dropna=True)

enter image description here

With .groupby() I’d instead get (while expecting the same result obtained above):

df.groupby(by=['country_live', 'employment_status'], dropna=True)['age'] 
    .mean() 
    .unstack()

enter image description here

Can someone explain the reason(s) why the two do not work the same (thus implicitly explaining the behaviour of dropna in .groupby())?

Asked By: amiola

||

Answers:

The main difference is that for .groupby() the dropna=True refers to the groups you are creating, NOT to the values. In fact if you add a row to your df:

row = {'age':50,'country_live':np.nan,'employment_status':'Partially employed by a company / organization'}

df = df.append(row, ignore_index=True)

the pivot table does not change the output changing the bool of dropna (you don’t have the nan group in the index.

The situation changes in the groupby:
With dropna=True you have the same result you obtained, with dropna=False, the nan group is added

Answered By: imburningbabe

Replying to your comment here since code formatting is a pain in comments.

I don’t know exactly how you tried dropna=False for groupby, but running the following code will show the group with nan value for country_live:

df = pd.DataFrame({
    'age': [31, np.nan, 28, 22, 54, np.nan, 49, 60, 25, np.nan],
    'country_live': ['Italy', pd.NA, 'Italy', 'Spain', 'France', 'Italy', 'Spain', 'Spain', 'France', 'Spain'],
    'employment_status': ['Fully employed by a company / organization', 'Partially employed by a company / organization',
    'Working student', 'Working student', 'Fully employed by a company / organization', 'Partially employed by a company / organization',
    'Fully employed by a company / organization', 'Fully employed by a company / organization', 'Working student',
    'Partially employed by a company / organization']
    },
)

df = df.assign(age=lambda t: t['age'].astype('Int64'), 
    country_live=lambda t: t['country_live'].astype('category'), 
    employment_status=lambda t: t['employment_status'].astype('category'))

for gp, sub_df in df.groupby(by=['country_live', 'employment_status'], dropna=False):
    print(gp, sub_df, "n", sep="n")

Output (see last lines):

('France', 'Fully employed by a company / organization')
   age country_live                           employment_status
4   54       France  Fully employed by a company / organization


('France', 'Working student')
   age country_live employment_status
8   25       France   Working student


('Italy', 'Fully employed by a company / organization')
   age country_live                           employment_status
0   31        Italy  Fully employed by a company / organization


('Italy', 'Partially employed by a company / organization')
    age country_live                               employment_status
5  <NA>        Italy  Partially employed by a company / organization


('Italy', 'Working student')
   age country_live employment_status
2   28        Italy   Working student


('Spain', 'Fully employed by a company / organization')
   age country_live                           employment_status
6   49        Spain  Fully employed by a company / organization
7   60        Spain  Fully employed by a company / organization


('Spain', 'Partially employed by a company / organization')
    age country_live                               employment_status
9  <NA>        Spain  Partially employed by a company / organization


('Spain', 'Working student')
   age country_live employment_status
3   22        Spain   Working student


(nan, 'Partially employed by a company / organization')
    age country_live                               employment_status
1  <NA>          NaN  Partially employed by a company / organization

Respectively, the nan group will be ignored if you set dropna=True

Answered By: Tranbi
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.