Outlier detection of time-series data
Question:
I have a pandas dataframe where I want to detect outliers on a single column. Please bear in mind that I am not experienced when it comes to data handling/cleaning.
The dataframe looks like this:
Time
MW
2019-01-01 00:00:00
1234.0
2019-01-01 01:00:00
1234.5
2019-01-01 02:00:00
1235.2
2019-01-01 03:00:00
1235.1
You can see on the figure below that I do have some outliers, most noticably the one right above the 08:00:00 mark.
What I want is to detect these outliers and, instead of removing them, I want to modify them to be the mean of the 10 values that come before it (So the value of row 11 would be changed to the mean value of rows 1-10).
I have looked into calculating the z-score and finding outliers based on that, but it seems to focus on the standard deviation of the total dataset, instead of only a local range. You can see the code that I wrote for this below.
def dfCleaning(df):
df.reset_index(drop=True, inplace=True)
z = np.abs(stats.zscore(df['MW']))
for x in np.where((z > 3) | (z < -3)):
for index in x:
df.loc[index, 'MW'] = df.iloc[max(0, index-11) : max(1,index-1)]['MW'].mean()
return df
So if there is a way to detect outliers based on a range of rows in a dataset (perhaps 50 rows above and below the row in focus), I would highly appreciate if someone could explain it to me.
Answers:
This is a sample for remove outlier function. But I recommend split your data into times for ex. day to day or week to week apply remove outlier function and then concatenate it.
def remove_outliers(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3 -q1
fence_low = q1 -1.5 *iqr
fence_high = q3 +1.5 *iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out
You can use:
def zscore(s, window, thresh=3, return_all=False):
roll = s.rolling(window=window, min_periods=1, center=True)
avg = roll.mean()
std = roll.std(ddof=0)
z = s.sub(avg).div(std)
m = z.between(-thresh, thresh)
if return_all:
return z, avg, std, m
return s.where(m, avg)
df['MW2'] = zscore(df['MW'], window=50)
full example:
N = 1000
np.random.seed(1)
df = pd.DataFrame({'MW': np.sin(np.linspace(0, 10, num=N))+np.random.normal(scale=0.6, size=N)})
z, avg, std, m = zscore(df['MW'], window=50, return_all=True)
ax = plt.subplot()
df['MW'].plot(label='data')
avg.plot(label='mean')
df.loc[~m, 'MW'].plot(label='outliers', marker='o', ls='')
avg[~m].plot(label='replacement', marker='o', ls='')
plt.legend()
Output:
With thresh=2
:
I have a pandas dataframe where I want to detect outliers on a single column. Please bear in mind that I am not experienced when it comes to data handling/cleaning.
The dataframe looks like this:
Time | MW |
---|---|
2019-01-01 00:00:00 | 1234.0 |
2019-01-01 01:00:00 | 1234.5 |
2019-01-01 02:00:00 | 1235.2 |
2019-01-01 03:00:00 | 1235.1 |
You can see on the figure below that I do have some outliers, most noticably the one right above the 08:00:00 mark.
What I want is to detect these outliers and, instead of removing them, I want to modify them to be the mean of the 10 values that come before it (So the value of row 11 would be changed to the mean value of rows 1-10).
I have looked into calculating the z-score and finding outliers based on that, but it seems to focus on the standard deviation of the total dataset, instead of only a local range. You can see the code that I wrote for this below.
def dfCleaning(df):
df.reset_index(drop=True, inplace=True)
z = np.abs(stats.zscore(df['MW']))
for x in np.where((z > 3) | (z < -3)):
for index in x:
df.loc[index, 'MW'] = df.iloc[max(0, index-11) : max(1,index-1)]['MW'].mean()
return df
So if there is a way to detect outliers based on a range of rows in a dataset (perhaps 50 rows above and below the row in focus), I would highly appreciate if someone could explain it to me.
This is a sample for remove outlier function. But I recommend split your data into times for ex. day to day or week to week apply remove outlier function and then concatenate it.
def remove_outliers(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3 -q1
fence_low = q1 -1.5 *iqr
fence_high = q3 +1.5 *iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out
You can use:
def zscore(s, window, thresh=3, return_all=False):
roll = s.rolling(window=window, min_periods=1, center=True)
avg = roll.mean()
std = roll.std(ddof=0)
z = s.sub(avg).div(std)
m = z.between(-thresh, thresh)
if return_all:
return z, avg, std, m
return s.where(m, avg)
df['MW2'] = zscore(df['MW'], window=50)
full example:
N = 1000
np.random.seed(1)
df = pd.DataFrame({'MW': np.sin(np.linspace(0, 10, num=N))+np.random.normal(scale=0.6, size=N)})
z, avg, std, m = zscore(df['MW'], window=50, return_all=True)
ax = plt.subplot()
df['MW'].plot(label='data')
avg.plot(label='mean')
df.loc[~m, 'MW'].plot(label='outliers', marker='o', ls='')
avg[~m].plot(label='replacement', marker='o', ls='')
plt.legend()
Output:
With thresh=2
: