Outlier detection of time-series data

Question:

I have a pandas dataframe where I want to detect outliers on a single column. Please bear in mind that I am not experienced when it comes to data handling/cleaning.

The dataframe looks like this:

Time MW
2019-01-01 00:00:00 1234.0
2019-01-01 01:00:00 1234.5
2019-01-01 02:00:00 1235.2
2019-01-01 03:00:00 1235.1

You can see on the figure below that I do have some outliers, most noticably the one right above the 08:00:00 mark.

enter image description here

What I want is to detect these outliers and, instead of removing them, I want to modify them to be the mean of the 10 values that come before it (So the value of row 11 would be changed to the mean value of rows 1-10).

I have looked into calculating the z-score and finding outliers based on that, but it seems to focus on the standard deviation of the total dataset, instead of only a local range. You can see the code that I wrote for this below.

def dfCleaning(df):
    df.reset_index(drop=True, inplace=True)
    z = np.abs(stats.zscore(df['MW']))

    for x in np.where((z > 3) | (z < -3)):
        for index in x:
            df.loc[index, 'MW'] = df.iloc[max(0, index-11) : max(1,index-1)]['MW'].mean()

    return df

So if there is a way to detect outliers based on a range of rows in a dataset (perhaps 50 rows above and below the row in focus), I would highly appreciate if someone could explain it to me.

Asked By: Jonas Svare

||

Answers:

This is a sample for remove outlier function. But I recommend split your data into times for ex. day to day or week to week apply remove outlier function and then concatenate it.

def remove_outliers(df_in, col_name):
        q1 = df_in[col_name].quantile(0.25)
        q3 = df_in[col_name].quantile(0.75)
        iqr = q3 -q1
        fence_low  = q1 -1.5 *iqr
        fence_high = q3 +1.5 *iqr
        df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
        return df_out
Answered By: Elkhan

You can use:

def zscore(s, window, thresh=3, return_all=False):
    roll = s.rolling(window=window, min_periods=1, center=True)
    avg = roll.mean()
    std = roll.std(ddof=0)
    z = s.sub(avg).div(std)   
    m = z.between(-thresh, thresh)
    
    if return_all:
        return z, avg, std, m
    return s.where(m, avg)

df['MW2'] = zscore(df['MW'], window=50)

full example:

N = 1000
np.random.seed(1)
df = pd.DataFrame({'MW': np.sin(np.linspace(0, 10, num=N))+np.random.normal(scale=0.6, size=N)})

z, avg, std, m = zscore(df['MW'], window=50, return_all=True)

ax = plt.subplot()

df['MW'].plot(label='data')
avg.plot(label='mean')
df.loc[~m, 'MW'].plot(label='outliers', marker='o', ls='')
avg[~m].plot(label='replacement', marker='o', ls='')
plt.legend()

Output:

enter image description here

With thresh=2:

enter image description here

Answered By: mozway