How can we make pandas default handling of missing values warn of their presence rather than silently ignore them?

Question:

As discussed here, pandas silently replaces NaN values with 0 when calculating sums, in contrast to explicit calculations as shown here:

import pandas as pd
import numpy  as np

np.NaN + np.NaN                              # Result: nan
pd.DataFrame([np.NaN,np.NaN]).sum().item()   # Result: 0.0

pandas’ Descriptive Statistics methods have a skipna argument. However, skipna is by default True, thereby masking the presence of missing values to casual users and novice programmers

This creates a risk that analyses will be "…quietly, accidentally wrong since their Pandas operators haven’t used the correct skipna" .

In Python, is there a way for users to set skipna=False as the default option?

Asked By: David Lovell

||

Answers:

It’s quite straightforward as in the documentation.

skipna (bool, default True) – Exclude NA/null values when computing the result.

The skipna paramter in the pd.DataFrame.sum() method defaults to True. So, when you take column sum, it skips the nan values and returns sum = 0.

If you set it to False and you see the intended behavior. However, there is no way of defaulting it to False. You have to set it to false via the parameter, unless you define your own wrapper around it.

import pandas as pd
import numpy  as np

np.NaN + np.NaN
pd.DataFrame([np.NaN,np.NaN]).sum(skipna=False)
0   NaN
dtype: float64

Here is a wrapper that can be defined to set your parameters to a custom value globally. This is code from this excellent SO answer.

## Function from - 
## https://stackoverflow.com/questions/55877832/setting-pandas-global-default-for-skipna-to-false

def set_default(func, **default):
    def inner(*args, **kwargs):
        kwargs.update(default)        # Update function kwargs w/ decorator defaults
        return func(*args, **kwargs)  # Call function w/ updated kwargs
    return inner                      # Return decorated function

pd.DataFrame.sum = set_default(pd.DataFrame.sum, skipna=False)
pd.DataFrame([np.NaN,np.NaN]).sum()
0   NaN
dtype: float64
Answered By: Akshay Sehgal
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.