how to complete missing data in a dataframe
Question:
i am using an API to download live stock market data.
this information a lot of the time is incomplete.
e.g;
Open High Low Close Adj Close Volume
Datetime
2022-02-16 15:00:00-05:00 172.872101 173.029999 172.839996 172.910004 172.910004 0
2022-02-16 15:01:00-05:00 172.899994 172.949997 172.779999 172.815002 172.815002 160249
2022-02-16 15:04:00-05:00 173.089996 173.320007 173.030106 173.315002 173.315002 311095
2022-02-16 15:05:00-05:00 173.320007 173.339996 173.164993 173.214996 173.214996 174639
2022-02-16 15:07:00-05:00 173.139999 173.179993 173.089996 173.160004 173.160004 135559
as you can tell by the timestamp , it skips a lot of information
my question is :
is there a way to complete that missing data to achieve something like this ?
Open High Low Close Adj Close Volume
Datetime
2022-02-16 15:00:00-05:00 172.872101 173.029999 172.839996 172.910004 172.910004 0
2022-02-16 15:01:00-05:00 172.899994 172.949997 172.779999 172.815002 172.815002 160249
2022-02-16 15:02:00-05:00 172.809998 172.990005 172.809998 172.979996 172.979996 119117
2022-02-16 15:03:00-05:00 172.970001 173.169998 172.964996 173.080093 173.080093 264624
2022-02-16 15:04:00-05:00 173.089996 173.320007 173.030106 173.315002 173.315002 311095
2022-02-16 15:05:00-05:00 173.320007 173.339996 173.164993 173.214996 173.214996 174639
2022-02-16 15:06:00-05:00 173.220001 173.220001 173.080002 173.139999 173.139999 124707
2022-02-16 15:07:00-05:00 173.139999 173.179993 173.089996 173.160004 173.160004 135559
Answers:
There are lots of ways to do this. Go through the whole blog.
https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779
- Drop the missing data if you’ve enough data for training.
- Add the data using the techniques in the blog.
With resample to 1 minute periods then interpolate to fill the NaN values
df = df.resample('1T').interpolate(method='linear', limit_direction='forward', axis=0)
How to complete the missing data with simple arithmetic average and taking into account NaN. The column to be completed is "VILLARTEAGA". I’m sorry I’m new in this.
from sklearn.impute import SimpleImputer
import numpy as np
X = dfTDia.iloc[:, 2].values
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X)
X = imputer.transform(X)
X
i am using an API to download live stock market data.
this information a lot of the time is incomplete.
e.g;
Open High Low Close Adj Close Volume
Datetime
2022-02-16 15:00:00-05:00 172.872101 173.029999 172.839996 172.910004 172.910004 0
2022-02-16 15:01:00-05:00 172.899994 172.949997 172.779999 172.815002 172.815002 160249
2022-02-16 15:04:00-05:00 173.089996 173.320007 173.030106 173.315002 173.315002 311095
2022-02-16 15:05:00-05:00 173.320007 173.339996 173.164993 173.214996 173.214996 174639
2022-02-16 15:07:00-05:00 173.139999 173.179993 173.089996 173.160004 173.160004 135559
as you can tell by the timestamp , it skips a lot of information
my question is :
is there a way to complete that missing data to achieve something like this ?
Open High Low Close Adj Close Volume
Datetime
2022-02-16 15:00:00-05:00 172.872101 173.029999 172.839996 172.910004 172.910004 0
2022-02-16 15:01:00-05:00 172.899994 172.949997 172.779999 172.815002 172.815002 160249
2022-02-16 15:02:00-05:00 172.809998 172.990005 172.809998 172.979996 172.979996 119117
2022-02-16 15:03:00-05:00 172.970001 173.169998 172.964996 173.080093 173.080093 264624
2022-02-16 15:04:00-05:00 173.089996 173.320007 173.030106 173.315002 173.315002 311095
2022-02-16 15:05:00-05:00 173.320007 173.339996 173.164993 173.214996 173.214996 174639
2022-02-16 15:06:00-05:00 173.220001 173.220001 173.080002 173.139999 173.139999 124707
2022-02-16 15:07:00-05:00 173.139999 173.179993 173.089996 173.160004 173.160004 135559
There are lots of ways to do this. Go through the whole blog.
https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779
- Drop the missing data if you’ve enough data for training.
- Add the data using the techniques in the blog.
With resample to 1 minute periods then interpolate to fill the NaN values
df = df.resample('1T').interpolate(method='linear', limit_direction='forward', axis=0)
How to complete the missing data with simple arithmetic average and taking into account NaN. The column to be completed is "VILLARTEAGA". I’m sorry I’m new in this.
from sklearn.impute import SimpleImputer
import numpy as np
X = dfTDia.iloc[:, 2].values
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X)
X = imputer.transform(X)
X