How do I remove outliers from a column in a dataframe?

Question:

The solutions I found online only show removing outliers from the entire dataframe, not just a specific column. So I’m having trouble figuring out how to perform outlier removal on a single column.

I tried creating a method, the code is shown below.

def find_outlier(df, column):
    # Find first and third quartile
    q1 = df[column].quantile(0.25)
    q3 = df[column].quantile(0.75)
    
    # Find interquartile range
    IQR = q3 - q1
    
    # Find lower and upper bound
    lower_bound = q1 - 1.5 * IQR
    upper_bound = q3 + 1.5 * IQR
    
    # Remove outliers
    df[column] = df[column][df[column] > lower_bound]
    df[column] = df[column][df[column] < upper_bound]
    
    return df

But when I ran the code, it said "Columns must be same length as key".

The code I used to run is shown below.

df['no_of_trainings'] = find_outlier(df, 'no_of_trainings')

Any help is appreciated.

Answers:

The comparison result is by-index, so you can use it to reduce the DataFrame

    df = df[df[column] > lower_bound]
    df = df[df[column] < upper_bound]
    return df

more concisely

    ...
    return df[(df[column] > lower_bound) & (df[column] < upper_bound)]
Answered By: ti7

There is no problem with your find_outlier code except for the return statement:

Should be

return df[column]

You code will replace outliers with NaN values.

Example code:

import pandas as pd

def find_outlier(df, column):
    # Find first and third quartile
    q1 = df[column].quantile(0.25)
    q3 = df[column].quantile(0.75)
    
    # Find interquartile range
    IQR = q3 - q1
    
    # Find lower and upper bound
    lower_bound = q1 - 1.5 * IQR
    upper_bound = q3 + 1.5 * IQR
    
    # Remove outliers
    df[column] = df[column][df[column] > lower_bound]
    df[column] = df[column][df[column] < upper_bound]
    
    return df[column]

df1 = pd.DataFrame({'no_of_trainings':[1,2,3,4,5,1000,6,7,8,9,10],
                    'other_data':[1,2,3,4,5,9,6,7,8,9,10]})

df1['no_of_trainings'] = find_outlier(df1, 'no_of_trainings')
print(df1)

Output:

Output

Note:

The outlier of 1000 was removed.

Answered By: ScottC