Imputation: Why do we replace the nan value with the Mean, and doesn't it affect our data?

Question:

Why do we replace the nan value in DataFrame with the Mean, and when we change it doesn’t it affect our data ?

0     1.048242
1     1.688173 
2          NaN
3     0.194162
4     0.194162
5     0.493194
6          NaN
7     0.675041
8          NaN
9     0.101743
10    3.112086
df['view_duration'].fillna(mean,inplace = True)

0     1.048242
1     1.688173
2     0.938350
3     0.194162
4     0.194162
5     0.493194
6     0.938350
7     0.675041
8     0.938350
9     0.101743
10    3.112086


Asked By: mohamed kamal

||

Answers:

Replacing Nulls with other relevant data (like Mean) is called imputation and is usually done for machine learning models as they cannot accept Nulls.

It will not change the Mean of the data.

Please note that if you have too many Nulls in the same column (usually above 30% but this should be considered on a case to case basis) – then we better not impute but drop the rows with Nulls.

Answered By: gtomer

It does.

The reason we do this is that many algorithms can’t operate on series containing NaNs – one particularly prominent example would be Fourier transform and its derivatives. Unlike more "regular" operations where NaNs just propagate, but a substantial part of the data may remain "clean", time series analysis is dead if you have as much as one NaN in the middle of the data.

And well, replacing with the mean is usually the most sensible default, but not always. Again, it is prominent in time series analysis – if you miss an entire period of observations in highly periodic data, replacing with mean would distort the end result much more than replacing with a sensible approximation of the "average" period trend (usually this is inconsequential though, if the filled-in period is so large it affects the analysis the analysis is likely bogus anyway). So, this is problem-specific, and it may take extreme care and domain knowledge to do it right.

Answered By: Lodinn
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.