Fill missing dates hourly per group with previous value in certain column using Pandas
Question:
I have the following dataframe (sample):
import pandas as pd
data = [['A', '2022-09-01 10:00:00', False, 2], ['A', '2022-09-01 14:00:00', False, 3],
['B', '2022-09-01 13:00:00', False, 1], ['B', '2022-09-01 16:00:00', True, 4]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value'])
group date indicator value
0 A 2022-09-01 10:00:00 False 2
1 A 2022-09-01 14:00:00 False 3
2 B 2022-09-01 13:00:00 False 1
3 B 2022-09-01 16:00:00 True 4
I would like to fill in the missing dates between dates hourly. So each hour that is missing between dates should be filled and the values should be the same as the previous data. Here is the desired output:
data = [['A', '2022-09-01 10:00:00', False, 2], ['A', '2022-09-01 11:00:00', False, 2],
['A', '2022-09-01 12:00:00', False, 2], ['A', '2022-09-01 13:00:00', False, 2],
['A', '2022-09-01 14:00:00', False, 3],
['B', '2022-09-01 13:00:00', False, 1], ['B', '2022-09-01 14:00:00', False, 1],
['B', '2022-09-01 15:00:00', False, 1], ['B', '2022-09-01 16:00:00', True, 4]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value'])
group date indicator value
0 A 2022-09-01 10:00:00 False 2
1 A 2022-09-01 11:00:00 False 2
2 A 2022-09-01 12:00:00 False 2
3 A 2022-09-01 13:00:00 False 2
4 A 2022-09-01 14:00:00 False 3
5 B 2022-09-01 13:00:00 False 1
6 B 2022-09-01 14:00:00 False 1
7 B 2022-09-01 15:00:00 False 1
8 B 2022-09-01 16:00:00 True 4
So I was wondering if it is possible to fill the missing dates hourly per group with the previous value in column value using Pandas
?
Answers:
You can use:
df['date'] = pd.to_datetime(df['date'])
out = (df
.groupby('group', as_index=False, group_keys=False)
.apply(lambda g: g.set_index('date')
.reindex(pd.date_range(g['date'].min(),
g['date'].max(),
freq='H'))
.ffill(downcast='infer').reset_index()
)
.reset_index(drop=True)
)
output:
index group indicator value
0 2022-09-01 10:00:00 A False 2
1 2022-09-01 11:00:00 A False 2
2 2022-09-01 12:00:00 A False 2
3 2022-09-01 13:00:00 A False 2
4 2022-09-01 14:00:00 A False 3
5 2022-09-01 13:00:00 B False 1
6 2022-09-01 14:00:00 B False 1
7 2022-09-01 15:00:00 B False 1
8 2022-09-01 16:00:00 B True 4
here is one other way about it
df['date']=pd.to_datetime(df['date'])
df2=(df.set_index('date' )
.groupby('group', group_keys=False)
.apply(lambda x: x.resample('1H').ffill())
.reset_index() )
df2
date group indicator value
0 2022-09-01 10:00:00 A False 2
1 2022-09-01 11:00:00 A False 2
2 2022-09-01 12:00:00 A False 2
3 2022-09-01 13:00:00 A False 2
4 2022-09-01 14:00:00 A False 3
5 2022-09-01 13:00:00 B False 1
6 2022-09-01 14:00:00 B False 1
7 2022-09-01 15:00:00 B False 1
8 2022-09-01 16:00:00 B True 4
One option is with complete from pyjanitor, to expose missing rows:
# pip install pyjanitor
import pandas as pd
import janitor
df['date'] = pd.to_datetime(df['date'])
# build a dictionary to contain the new dates
# the key of the dictionary must exist in the dataframe
new_date = {'date': lambda date: pd.date_range(date.min(), date.max(), freq='H')}
df.complete(new_date, by = 'group').ffill(downcast='infer')
group date indicator value
0 A 2022-09-01 10:00:00 False 2
1 A 2022-09-01 11:00:00 False 2
2 A 2022-09-01 12:00:00 False 2
3 A 2022-09-01 13:00:00 False 2
4 A 2022-09-01 14:00:00 False 3
5 B 2022-09-01 13:00:00 False 1
6 B 2022-09-01 14:00:00 False 1
7 B 2022-09-01 15:00:00 False 1
8 B 2022-09-01 16:00:00 True 4
I have the following dataframe (sample):
import pandas as pd
data = [['A', '2022-09-01 10:00:00', False, 2], ['A', '2022-09-01 14:00:00', False, 3],
['B', '2022-09-01 13:00:00', False, 1], ['B', '2022-09-01 16:00:00', True, 4]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value'])
group date indicator value
0 A 2022-09-01 10:00:00 False 2
1 A 2022-09-01 14:00:00 False 3
2 B 2022-09-01 13:00:00 False 1
3 B 2022-09-01 16:00:00 True 4
I would like to fill in the missing dates between dates hourly. So each hour that is missing between dates should be filled and the values should be the same as the previous data. Here is the desired output:
data = [['A', '2022-09-01 10:00:00', False, 2], ['A', '2022-09-01 11:00:00', False, 2],
['A', '2022-09-01 12:00:00', False, 2], ['A', '2022-09-01 13:00:00', False, 2],
['A', '2022-09-01 14:00:00', False, 3],
['B', '2022-09-01 13:00:00', False, 1], ['B', '2022-09-01 14:00:00', False, 1],
['B', '2022-09-01 15:00:00', False, 1], ['B', '2022-09-01 16:00:00', True, 4]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value'])
group date indicator value
0 A 2022-09-01 10:00:00 False 2
1 A 2022-09-01 11:00:00 False 2
2 A 2022-09-01 12:00:00 False 2
3 A 2022-09-01 13:00:00 False 2
4 A 2022-09-01 14:00:00 False 3
5 B 2022-09-01 13:00:00 False 1
6 B 2022-09-01 14:00:00 False 1
7 B 2022-09-01 15:00:00 False 1
8 B 2022-09-01 16:00:00 True 4
So I was wondering if it is possible to fill the missing dates hourly per group with the previous value in column value using Pandas
?
You can use:
df['date'] = pd.to_datetime(df['date'])
out = (df
.groupby('group', as_index=False, group_keys=False)
.apply(lambda g: g.set_index('date')
.reindex(pd.date_range(g['date'].min(),
g['date'].max(),
freq='H'))
.ffill(downcast='infer').reset_index()
)
.reset_index(drop=True)
)
output:
index group indicator value
0 2022-09-01 10:00:00 A False 2
1 2022-09-01 11:00:00 A False 2
2 2022-09-01 12:00:00 A False 2
3 2022-09-01 13:00:00 A False 2
4 2022-09-01 14:00:00 A False 3
5 2022-09-01 13:00:00 B False 1
6 2022-09-01 14:00:00 B False 1
7 2022-09-01 15:00:00 B False 1
8 2022-09-01 16:00:00 B True 4
here is one other way about it
df['date']=pd.to_datetime(df['date'])
df2=(df.set_index('date' )
.groupby('group', group_keys=False)
.apply(lambda x: x.resample('1H').ffill())
.reset_index() )
df2
date group indicator value
0 2022-09-01 10:00:00 A False 2
1 2022-09-01 11:00:00 A False 2
2 2022-09-01 12:00:00 A False 2
3 2022-09-01 13:00:00 A False 2
4 2022-09-01 14:00:00 A False 3
5 2022-09-01 13:00:00 B False 1
6 2022-09-01 14:00:00 B False 1
7 2022-09-01 15:00:00 B False 1
8 2022-09-01 16:00:00 B True 4
One option is with complete from pyjanitor, to expose missing rows:
# pip install pyjanitor
import pandas as pd
import janitor
df['date'] = pd.to_datetime(df['date'])
# build a dictionary to contain the new dates
# the key of the dictionary must exist in the dataframe
new_date = {'date': lambda date: pd.date_range(date.min(), date.max(), freq='H')}
df.complete(new_date, by = 'group').ffill(downcast='infer')
group date indicator value
0 A 2022-09-01 10:00:00 False 2
1 A 2022-09-01 11:00:00 False 2
2 A 2022-09-01 12:00:00 False 2
3 A 2022-09-01 13:00:00 False 2
4 A 2022-09-01 14:00:00 False 3
5 B 2022-09-01 13:00:00 False 1
6 B 2022-09-01 14:00:00 False 1
7 B 2022-09-01 15:00:00 False 1
8 B 2022-09-01 16:00:00 True 4