How to detect outliers in a timeseries dataframe and write the "clean" ones in a new dataframe
Question:
I’m really new to Python (and programming in general, hihi) and I’m analyzing 2 years of metereological data measured every 10s, in total I have 12 metereological parameters and I’ve created my dataframe df
with the time as my row index and the name of the metereological parameters as the column names. Since I don’t need a super granularity, I’ve resampled the data to hourly data, so the dataframe looks something like this.
Time G_DIFF G_HOR G_INCL RAIN RH T_a V_a V_a_dir
2016-05-01 02:00:00 0.0 0.011111 0.000000 0.013333 100.0 9.128167 1.038944 175.378056
2016-05-01 03:00:00 0.0 0.200000 0.016667 0.020000 100.0 8.745833 1.636944 218.617500
2016-05-01 04:00:00 0.0 0.105556 0.013889 0.010000 100.0 8.295333 0.931000 232.873333
There are outliers and I can get rid of them with a rolling standard deviation and mean which is what I’ve done "by hand" with the following code for one of the columns (the ambient temperature) where the algorithm writes the clean data in another dataframe (tr, in the example below).
roll = df["T_a"].rolling(24,center = True) #24h window
mean, std = roll.mean(), roll.std()
cut = std*3
low, up = mean - cut, mean+cut
tr.loc[(df["T_a"] < low) | (df["T_a"] > up) | (df["T_a"].isna()), "outliers"] = df["T_a"]
tr.loc[(df["T_a"] >= low) & (df["T_a"] <= up), "T_a"] = df["T_a"]
tr.loc[tr["T_a"].isna(),"T_a"] = tr["T_a"].bfill() #to input a value when a datum is NaN
Now, as I said, that works okay for a column BUT I would like to be able to do it for the 12 columns and, also, I’m almost sure that there’s a more pythonic way to do it. I guess with a for loop should be feasible but nothing I’ve tried so far is working.
Could anyone give me some light, please? Thank you so much!!
Answers:
There are two ways to remove outliers from time series data one is calculating percentile, mean std-dev which I am thinking you are using another way is looking at the graphs because sometimes data spread gives more information visually.
I have worked in data of yellow taxi prediction in a certain area, so basically I have a model which can predict in which region of NYC taxi can get more customers.
In that I had a time series data with 10-sec intervals with various features like trip distance,speed, working hours, and one was "Total fare", now I also wanted to remove the outliers from each column so started using mean and percentiles to do so,
The thing with total fares was mean and percentile was not giving an accurate threshold,
and my percentiles values:
0 percentile value is -242.55//
10 percentile value is 6.3//
20 percentile value is 7.8//
30 percentile value is 8.8//
40 percentile value is 9.8//
50 percentile value is 11.16//
60 percentile value is 12.8//
70 percentile value is 14.8//
80 percentile value is 18.3//
90 percentile value is 25.8//
100 percentile value is 3950611.6//
as you can see 100 was an ok fare but was considered as an outlier,
So I basically turned to visualization,
I sorted my fare values and plot it
as you can see in the end there is little of steepness
so basically magnified it,
Something like this,
and then I magnified it more for 50th to second last percentile
and voila I got my threshold, i.e 1000,
This method in actual terms is called the "elbow method", what you are doing is the first step and if you are not happy this can be the second step to find those thresholds,
I suggest you go from column to column and use any of these techniques because if you go from column to column you know how much data you are losing because losing data is losing information.
Personally, I follow visualization, in the end, it really depends on the data.
all_columns = [df.columns] #This will give you list of all column names
all_columns = all_columns.remove('G_DIFF') # This will remove the column name that doesn't include those 12 columns
for column in all_columns:
roll = df[column].rolling(24,center = True) #24h window
mean, std = roll.mean(), roll.std()
cut = std*3
low, up = mean - cut, mean+cut
tr.loc[(df[column] < low) | (df[column] > up) | (df[column].isna()), "outliers"] = df[column]
tr.loc[(df[column] >= low) & (df[column] <= up), column] = df[column]
tr.loc[tr[column].isna(),column] = tr[column].bfill() #to input a value when a datum is NaN
I’m really new to Python (and programming in general, hihi) and I’m analyzing 2 years of metereological data measured every 10s, in total I have 12 metereological parameters and I’ve created my dataframe df
with the time as my row index and the name of the metereological parameters as the column names. Since I don’t need a super granularity, I’ve resampled the data to hourly data, so the dataframe looks something like this.
Time G_DIFF G_HOR G_INCL RAIN RH T_a V_a V_a_dir
2016-05-01 02:00:00 0.0 0.011111 0.000000 0.013333 100.0 9.128167 1.038944 175.378056
2016-05-01 03:00:00 0.0 0.200000 0.016667 0.020000 100.0 8.745833 1.636944 218.617500
2016-05-01 04:00:00 0.0 0.105556 0.013889 0.010000 100.0 8.295333 0.931000 232.873333
There are outliers and I can get rid of them with a rolling standard deviation and mean which is what I’ve done "by hand" with the following code for one of the columns (the ambient temperature) where the algorithm writes the clean data in another dataframe (tr, in the example below).
roll = df["T_a"].rolling(24,center = True) #24h window
mean, std = roll.mean(), roll.std()
cut = std*3
low, up = mean - cut, mean+cut
tr.loc[(df["T_a"] < low) | (df["T_a"] > up) | (df["T_a"].isna()), "outliers"] = df["T_a"]
tr.loc[(df["T_a"] >= low) & (df["T_a"] <= up), "T_a"] = df["T_a"]
tr.loc[tr["T_a"].isna(),"T_a"] = tr["T_a"].bfill() #to input a value when a datum is NaN
Now, as I said, that works okay for a column BUT I would like to be able to do it for the 12 columns and, also, I’m almost sure that there’s a more pythonic way to do it. I guess with a for loop should be feasible but nothing I’ve tried so far is working.
Could anyone give me some light, please? Thank you so much!!
There are two ways to remove outliers from time series data one is calculating percentile, mean std-dev which I am thinking you are using another way is looking at the graphs because sometimes data spread gives more information visually.
I have worked in data of yellow taxi prediction in a certain area, so basically I have a model which can predict in which region of NYC taxi can get more customers.
In that I had a time series data with 10-sec intervals with various features like trip distance,speed, working hours, and one was "Total fare", now I also wanted to remove the outliers from each column so started using mean and percentiles to do so,
The thing with total fares was mean and percentile was not giving an accurate threshold,
and my percentiles values:
0 percentile value is -242.55//
10 percentile value is 6.3//
20 percentile value is 7.8//
30 percentile value is 8.8//
40 percentile value is 9.8//
50 percentile value is 11.16//
60 percentile value is 12.8//
70 percentile value is 14.8//
80 percentile value is 18.3//
90 percentile value is 25.8//
100 percentile value is 3950611.6//
as you can see 100 was an ok fare but was considered as an outlier,
So I basically turned to visualization,
I sorted my fare values and plot it
as you can see in the end there is little of steepness
so basically magnified it,
Something like this,
and then I magnified it more for 50th to second last percentile
and voila I got my threshold, i.e 1000,
This method in actual terms is called the "elbow method", what you are doing is the first step and if you are not happy this can be the second step to find those thresholds,
I suggest you go from column to column and use any of these techniques because if you go from column to column you know how much data you are losing because losing data is losing information.
Personally, I follow visualization, in the end, it really depends on the data.
all_columns = [df.columns] #This will give you list of all column names
all_columns = all_columns.remove('G_DIFF') # This will remove the column name that doesn't include those 12 columns
for column in all_columns:
roll = df[column].rolling(24,center = True) #24h window
mean, std = roll.mean(), roll.std()
cut = std*3
low, up = mean - cut, mean+cut
tr.loc[(df[column] < low) | (df[column] > up) | (df[column].isna()), "outliers"] = df[column]
tr.loc[(df[column] >= low) & (df[column] <= up), column] = df[column]
tr.loc[tr[column].isna(),column] = tr[column].bfill() #to input a value when a datum is NaN