How to detect outlier in dataframe (e.g. 90% percentile) for EVERY column

Question:

My dataframe can be simplified like this:

Dataframe :

df = pd.DataFrame({'Customer_ID': range(1, 9),  'Col 1': [32, 8, 21, 8, 25, 28, 26, 32], 'Col 2': [1, 3, 4, 22, 25, 42, 1, 33],
'Col 3' : [10, 1, 8, 6, 5, 2, 7, 3]})

{'Customer_ID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8},
 'Col 1': {0: 32, 1: 8, 2: 21, 3: 8, 4: 25, 5: 28, 6: 26, 7: 32},
 'Col 2': {0: 1, 1: 3, 2: 4, 3: 22, 4: 25, 5: 42, 6: 1, 7: 33},
 'Col 3': {0: 10, 1: 1, 2: 8, 3: 6, 4: 5, 5: 2, 6: 7, 7: 3}}

How can I check this dataset for outliers based on the 90% percentile for each column, and create a resulting description like this:

df = pd.DataFrame({'Customer_ID': range(1, 9),  'Col 1': [32, 8, 21, 8, 25, 28, 26, 32], 'Col 2': [1, 3, 4, 22, 25, 42, 1, 33],
'Col 3' : [10, 1, 8, 6, 5, 2, 7, 3], 'Description': ['Col 1 & Col 3 = outliers', '-', '-', '-', '-', 'Col 2 = Outlier', '-', 'Col 1 = Outlier']})

desired output

I know that I can compute the q-th quantile for each column with :
df[['Col 1','Col 2','Col 3' ]].quantile(.90)

Asked By: Sina F

||

Answers:

You can use the describe() function and change the percentile option.

# Create datafram
data = pd.DataFrame([[1,2,3],[4,5,6], [7,8,9]])

# Defaut value for describe function
data.describe(percentiles=[.25, .5, .75])
data.describe()

# Change percentiles values - Add what you want
data.describe(percentiles=[0.1, .5, 0.9])

With that on your data in a NON elegant way ( 😀 ) :

df = pd.DataFrame({'Customer_ID': range(1, 9),  'Col 1': [32, 8, 21, 8, 25, 28, 26, 32], 'Col 2': [1, 3, 4, 22, 25, 42, 1, 33],
'Col 3' : [10, 1, 8, 6, 5, 2, 7, 3]})

df['Outlier_1'] = df['Col 1'].apply(lambda x: ((x >= df['Col 1'].describe([0.9])[5])).sum())
df['Outlier_2'] = df['Col 2'].apply(lambda x: ((x >= df['Col 2'].describe([0.9])[5])).sum())
df['Outlier_3'] = df['Col 3'].apply(lambda x: ((x >= df['Col 3'].describe([0.9])[5])).sum())

df.loc[(df['Outlier_1'] >= 1), 'Outlier_1'] = 'Outlier_1'
df.loc[(df['Outlier_2'] >= 1), 'Outlier_2'] = 'Outlier_2'
df.loc[(df['Outlier_3'] >= 1), 'Outlier_3'] = 'Outlier_3'

df['Outlier'] = df['Outlier_1'] + df['Outlier_2'] +df['Outlier_2']

You can then drop or keep the columns. You can probably simplify that by encapsulating the idea into a function and loop over the desired columns ?

Hope it helps.

Answered By: bvittrant
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.