How to detect outliers in a timeseries dataframe and write the "clean" ones in a new dataframe

Question:

I’m really new to Python (and programming in general, hihi) and I’m analyzing 2 years of metereological data measured every 10s, in total I have 12 metereological parameters and I’ve created my dataframe df with the time as my row index and the name of the metereological parameters as the column names. Since I don’t need a super granularity, I’ve resampled the data to hourly data, so the dataframe looks something like this.

Time                G_DIFF  G_HOR     G_INCL     RAIN    RH   T_a    V_a    V_a_dir                 
2016-05-01 02:00:00 0.0 0.011111    0.000000    0.013333    100.0   9.128167    1.038944    175.378056
2016-05-01 03:00:00 0.0 0.200000    0.016667    0.020000    100.0   8.745833    1.636944    218.617500
2016-05-01 04:00:00 0.0 0.105556    0.013889    0.010000    100.0   8.295333    0.931000    232.873333

There are outliers and I can get rid of them with a rolling standard deviation and mean which is what I’ve done "by hand" with the following code for one of the columns (the ambient temperature) where the algorithm writes the clean data in another dataframe (tr, in the example below).

roll = df["T_a"].rolling(24,center = True) #24h window
mean, std = roll.mean(), roll.std()
cut = std*3
low, up = mean - cut, mean+cut
tr.loc[(df["T_a"] < low) | (df["T_a"] > up) | (df["T_a"].isna()), "outliers"] = df["T_a"]
tr.loc[(df["T_a"] >= low) & (df["T_a"] <= up), "T_a"] = df["T_a"]
tr.loc[tr["T_a"].isna(),"T_a"] = tr["T_a"].bfill() #to input a value when a datum is NaN

Now, as I said, that works okay for a column BUT I would like to be able to do it for the 12 columns and, also, I’m almost sure that there’s a more pythonic way to do it. I guess with a for loop should be feasible but nothing I’ve tried so far is working.

Could anyone give me some light, please? Thank you so much!!

Asked By: C.S. Seefoo

||

Answers:

There are two ways to remove outliers from time series data one is calculating percentile, mean std-dev which I am thinking you are using another way is looking at the graphs because sometimes data spread gives more information visually.

I have worked in data of yellow taxi prediction in a certain area, so basically I have a model which can predict in which region of NYC taxi can get more customers.

In that I had a time series data with 10-sec intervals with various features like trip distance,speed, working hours, and one was "Total fare", now I also wanted to remove the outliers from each column so started using mean and percentiles to do so,

The thing with total fares was mean and percentile was not giving an accurate threshold,

BOX PLOT for :Total fares-

and my percentiles values:

0 percentile value is -242.55//
10 percentile value is 6.3//
20 percentile value is 7.8//
30 percentile value is 8.8//
40 percentile value is 9.8//
50 percentile value is 11.16//
60 percentile value is 12.8//
70 percentile value is 14.8//
80 percentile value is 18.3//
90 percentile value is 25.8//
100 percentile value is 3950611.6//

as you can see 100 was an ok fare but was considered as an outlier,

So I basically turned to visualization,

I sorted my fare values and plot it

Fare values on y axis

as you can see in the end there is little of steepness

so basically magnified it,

Something like this,

Fare sorted magnified for last three percentiles

and then I magnified it more for 50th to second last percentile

50th to second last percentile

and voila I got my threshold, i.e 1000,

This method in actual terms is called the "elbow method", what you are doing is the first step and if you are not happy this can be the second step to find those thresholds,

I suggest you go from column to column and use any of these techniques because if you go from column to column you know how much data you are losing because losing data is losing information.

Personally, I follow visualization, in the end, it really depends on the data.

Answered By: Raghav Agarwal
all_columns = [df.columns] #This will give you list of all column names
all_columns = all_columns.remove('G_DIFF') # This will remove the column name that doesn't include those 12 columns

for column in all_columns:
    roll = df[column].rolling(24,center = True) #24h window
    mean, std = roll.mean(), roll.std()
    cut = std*3
    low, up = mean - cut, mean+cut
    tr.loc[(df[column] < low) | (df[column] > up) | (df[column].isna()), "outliers"] = df[column]
    tr.loc[(df[column] >= low) & (df[column] <= up), column] = df[column]
    tr.loc[tr[column].isna(),column] = tr[column].bfill() #to input a value when a datum is NaN
    
Answered By: ASLAN
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.