How to remove Outliers in Python?
Question:
I want to remove outliers from my dataset “train” for which purpose I’ve decided to use z-score or IQR.
I’m running Jupyter notebook on Microsoft Python Client for SQL Server.
I’ve tried for z-score:
from scipy import stats
train[(np.abs(stats.zscore(train)) < 3).all(axis=1)]
for IQR:
Q1 = train.quantile(0.02)
Q3 = train.quantile(0.98)
IQR = Q3 - Q1
train = train[~((train < (Q1 - 1.5 * IQR)) |(train > (Q3 + 1.5 *
IQR))).any(axis=1)]
…which returns…
for z-score:
TypeError: unsupported operand type(s) for /: ‘str’ and ‘int’
for IQR:
TypeError: unorderable types: str() < float()
My train dataset looks like:
# Number of each type of column
print('Training data shape: ', train.shape)
train.dtypes.value_counts()
Training data shape: (300000, 111) int32 66 float64 30 object 15 dtype: int64
Help would be appreciated.
Answers:
You’re having trouble with your code because you’re trying to calculate zscore
on categorical columns.
To avoid this, you should first separate your train into parts with numerical and categorical features:
num_train = train.select_dtypes(include=["number"])
cat_train = train.select_dtypes(exclude=["number"])
and only after that calculate index of rows to keep:
idx = np.all(stats.zscore(num_train) < 3, axis=1)
and finally add the two pieces together:
train_cleaned = pd.concat([num_train.loc[idx], cat_train.loc[idx]], axis=1)
For IQR part:
Q1 = num_train.quantile(0.02)
Q3 = num_train.quantile(0.98)
IQR = Q3 - Q1
idx = ~((num_train < (Q1 - 1.5 * IQR)) | (num_train > (Q3 + 1.5 * IQR))).any(axis=1)
train_cleaned = pd.concat([num_train.loc[idx], cat_train.loc[idx]], axis=1)
Please let us know if you have any further questions.
PS
As well, you might consider one more approach for dealing with outliers with pandas.DataFrame.clip, which will clip outliers on a case-by-case basis instead of dropping a row altogether.
you can use autooptimizer module.
pip install autooptimizer
from autooptimizer.process import outlier_removal
I want to remove outliers from my dataset “train” for which purpose I’ve decided to use z-score or IQR.
I’m running Jupyter notebook on Microsoft Python Client for SQL Server.
I’ve tried for z-score:
from scipy import stats
train[(np.abs(stats.zscore(train)) < 3).all(axis=1)]
for IQR:
Q1 = train.quantile(0.02)
Q3 = train.quantile(0.98)
IQR = Q3 - Q1
train = train[~((train < (Q1 - 1.5 * IQR)) |(train > (Q3 + 1.5 *
IQR))).any(axis=1)]
…which returns…
for z-score:
TypeError: unsupported operand type(s) for /: ‘str’ and ‘int’
for IQR:
TypeError: unorderable types: str() < float()
My train dataset looks like:
# Number of each type of column
print('Training data shape: ', train.shape)
train.dtypes.value_counts()
Training data shape: (300000, 111) int32 66 float64 30 object 15 dtype: int64
Help would be appreciated.
You’re having trouble with your code because you’re trying to calculate zscore
on categorical columns.
To avoid this, you should first separate your train into parts with numerical and categorical features:
num_train = train.select_dtypes(include=["number"])
cat_train = train.select_dtypes(exclude=["number"])
and only after that calculate index of rows to keep:
idx = np.all(stats.zscore(num_train) < 3, axis=1)
and finally add the two pieces together:
train_cleaned = pd.concat([num_train.loc[idx], cat_train.loc[idx]], axis=1)
For IQR part:
Q1 = num_train.quantile(0.02)
Q3 = num_train.quantile(0.98)
IQR = Q3 - Q1
idx = ~((num_train < (Q1 - 1.5 * IQR)) | (num_train > (Q3 + 1.5 * IQR))).any(axis=1)
train_cleaned = pd.concat([num_train.loc[idx], cat_train.loc[idx]], axis=1)
Please let us know if you have any further questions.
PS
As well, you might consider one more approach for dealing with outliers with pandas.DataFrame.clip, which will clip outliers on a case-by-case basis instead of dropping a row altogether.
you can use autooptimizer module.
pip install autooptimizer
from autooptimizer.process import outlier_removal