How to Replace Outliers with Median in Pandas dataframe?

Question:

Here’s my dataframe:

cars_num_df.head(10)

    mpg cylinders   displacement    horsepower  weight  acceleration    age
0   18.0    8          307.0          130.0     3504.0     12.0         13
1   15.0    8          350.0          165.0     3693.0     11.5         13
2   18.0    8          318.0          150.0     3436.0     11.0         13
3   16.0    8          304.0          150.0     3433.0     12.0         13
4   17.0    8          302.0          140.0     3449.0     10.5         13
5   15.0    8          429.0          198.0     4341.0     10.0         13
6   14.0    8          454.0          220.0     4354.0      9.0         13
7   14.0    8          440.0          215.0     4312.0      8.5         13
8   14.0    8          455.0          225.0     4425.0     10.0         13
9   15.0    8          390.0          190.0     3850.0      8.5         13

Later on, I have standardized the data using Zscore and then I want to REPLACE the outliers (not remove) with the median value of each column.

I tried doing this:

median = cars_numz_df.median()
std = cars_numz_df.std()
value = cars_numz_df

outliers = (value - median).abs() > 2*std

cars_numz_df[outliers] = cars_numz_df[outliers].abs()

cars_numz_df[outliers]


    mpg cylinders   displacement    horsepower  weight  acceleration    age
0   NaN 1.498191    NaN             NaN         NaN     NaN             NaN
1   NaN 1.498191    NaN             NaN         NaN     NaN             NaN
2   NaN 1.498191    NaN             NaN         NaN     NaN             NaN
3   NaN 1.498191    NaN             NaN         NaN     NaN             NaN
4   NaN 1.498191    NaN             NaN         NaN     NaN             NaN
5   NaN 1.498191    2.262118        2.454408    NaN     NaN             NaN
6   NaN 1.498191    2.502182        3.030708    NaN     2.384735        NaN
7   NaN 1.498191    2.367746        2.899730    NaN     2.566274        NaN
8   NaN 1.498191    2.511784        3.161685    NaN     NaN             NaN
9   NaN 1.498191    1.887617        2.244844    NaN     2.566274        NaN

Now, I’m trying to replace the outliers with the median by doing this:

cars_numz_df[outliers] = median

but I get this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-394-d48a51500f28> in <module>
      9 cars_numz_df[outliers] = cars_numz_df[outliers].abs()
     10 
---> 11 cars_numz_df[outliers] = median
     12 

~AppDataLocalContinuumanaconda3libsite-packagespandascoreframe.py         
in __setitem__(self, key, value)
   3112 
   3113         if isinstance(key, DataFrame) or getattr(key, 'ndim', None) 
== 2:
-> 3114             self._setitem_frame(key, value)
   3115         elif isinstance(key, (Series, np.ndarray, list, Index)):
   3116             self._setitem_array(key, value)

~AppDataLocalContinuumanaconda3libsite-packagespandascoreframe.py 
in _setitem_frame(self, key, value)
   3161         self._check_inplace_setting(value)
   3162         self._check_setitem_copy()
-> 3163         self._where(-key, value, inplace=True)
   3164 
   3165     def _ensure_valid_index(self, value):

~AppDataLocalContinuumanaconda3libsite-packagespandascoregeneric.py 
in _where(self, cond, other, inplace, axis, level, errors, try_cast)
   7543 
   7544                 _, other = self.align(other, join='left', axis=axis,
-> 7545                                       level=level, 
fill_value=np.nan)
   7546 
   7547                 # if we are NOT aligned, raise as we cannot where 
index

~AppDataLocalContinuumanaconda3libsite-packagespandascoreframe.py 
in align(self, other, join, axis, level, copy, fill_value, method, limit, 
fill_axis, broadcast_axis)
   3548                                             method=method, 
limit=limit,
   3549                                             fill_axis=fill_axis,
-> 3550                                             
broadcast_axis=broadcast_axis)
   3551 
   3552     @Appender(_shared_docs['reindex'] % _shared_doc_kwargs)

~AppDataLocalContinuumanaconda3libsite-packagespandascoregeneric.py 
in align(self, other, join, axis, level, copy, fill_value, method, limit, 
fill_axis, broadcast_axis)
   7370                                       copy=copy, 
fill_value=fill_value,
   7371                                       method=method, limit=limit,
-> 7372                                       fill_axis=fill_axis)
   7373         else:  # pragma: no cover
   7374             raise TypeError('unsupported type: %s' % type(other))

~AppDataLocalContinuumanaconda3libsite-packagespandascoregeneric.py 
in _align_series(self, other, join, axis, level, copy, fill_value, method, 
limit, fill_axis)
   7469                     fdata = fdata.reindex_indexer(join_index, lidx, 
axis=0)
   7470             else:
-> 7471                 raise ValueError('Must specify axis=0 or 1')
   7472 
   7473             if copy and fdata is self._data:

ValueError: Must specify axis=0 or 1

Please advise, how can I replace the outliers with column median.

Asked By: morelloking

||

Answers:

I don’t have access to the dataset proposed in the question and therefore construct a randomized set of data.

import pandas as pd
import random as r
import numpy as np

d = [r.random()*1000 for i in range(0,100)]
df = pd.DataFrame({'Values': d})

median = df['Values'].median()
std = df['Values'].std()
outliers = (df['Values'] - median).abs() > std
df[outliers] = np.nan
df['Values'].fillna(median, inplace=True)

FWIW, clipping and winsorization should also be considered when trying to scoot outliers to somewhere useful.

Answered By: Rich Andrews

In your example outliers returns a boolean DataFrame which can be used as a mask:

cars_numz_df.mask(outliers, other=median, axis=1, inplace=True)

Shown with another example:

import numpy as np
import pandas as pd

np.random.seed(0) # seed random
df = pd.DataFrame(np.random.rand(10,2)) # 2col dataframe

median = df.median() # 55.84, 68.05
std = df.std()
value = df

outliers = (value-median).abs() > 2*std

df.mask(outliers, other=median, axis=1, inplace=True)
df
Answered By: radarN

The answer of @Rich Andrews is missing a z_thresh to say inside of how many sigmas you want to keep, here is the expanded version inside of a function:

def replace_numerical_outliers(df, column_name, z_thresh=3):
    median = df[column_name].median()
    std = df[column_name].std()
    outliers = ((df[column_name] - median).abs()) > z_thresh*std
    df[outliers] = np.nan
    df[column_name].fillna(median, inplace=True)
Answered By: Caridorc
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.