Fill missing dates hourly per group with previous value in certain column using Pandas

Question

I have the following dataframe (sample):

import pandas as pd

data = [['A', '2022-09-01 10:00:00', False, 2], ['A', '2022-09-01 14:00:00', False, 3],
        ['B', '2022-09-01 13:00:00', False, 1], ['B', '2022-09-01 16:00:00', True, 4]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value'])

  group                 date  indicator  value
0     A  2022-09-01 10:00:00      False      2
1     A  2022-09-01 14:00:00      False      3
2     B  2022-09-01 13:00:00      False      1
3     B  2022-09-01 16:00:00       True      4

I would like to fill in the missing dates between dates hourly. So each hour that is missing between dates should be filled and the values should be the same as the previous data. Here is the desired output:

data = [['A', '2022-09-01 10:00:00', False, 2], ['A', '2022-09-01 11:00:00', False, 2], 
        ['A', '2022-09-01 12:00:00', False, 2], ['A', '2022-09-01 13:00:00', False, 2], 
        ['A', '2022-09-01 14:00:00', False, 3],
        ['B', '2022-09-01 13:00:00', False, 1], ['B', '2022-09-01 14:00:00', False, 1],
        ['B', '2022-09-01 15:00:00', False, 1], ['B', '2022-09-01 16:00:00', True, 4]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value'])

  group                 date  indicator  value
0     A  2022-09-01 10:00:00      False      2
1     A  2022-09-01 11:00:00      False      2
2     A  2022-09-01 12:00:00      False      2
3     A  2022-09-01 13:00:00      False      2
4     A  2022-09-01 14:00:00      False      3
5     B  2022-09-01 13:00:00      False      1
6     B  2022-09-01 14:00:00      False      1
7     B  2022-09-01 15:00:00      False      1
8     B  2022-09-01 16:00:00       True      4

So I was wondering if it is possible to fill the missing dates hourly per group with the previous value in column value using Pandas?

Asked By: Quinten

||

Source

Answer 1

You can use:

df['date'] = pd.to_datetime(df['date'])

out = (df
   .groupby('group', as_index=False, group_keys=False)
   .apply(lambda g: g.set_index('date')
                     .reindex(pd.date_range(g['date'].min(),
                                            g['date'].max(),
                                            freq='H'))
                     .ffill(downcast='infer').reset_index()
         
         )
   .reset_index(drop=True)
)

output:

                index group  indicator  value
0 2022-09-01 10:00:00     A      False      2
1 2022-09-01 11:00:00     A      False      2
2 2022-09-01 12:00:00     A      False      2
3 2022-09-01 13:00:00     A      False      2
4 2022-09-01 14:00:00     A      False      3
5 2022-09-01 13:00:00     B      False      1
6 2022-09-01 14:00:00     B      False      1
7 2022-09-01 15:00:00     B      False      1
8 2022-09-01 16:00:00     B       True      4

Answered By: mozway

Answer 2

here is one other way about it

df['date']=pd.to_datetime(df['date'])

df2=(df.set_index('date' )
     .groupby('group',  group_keys=False)
     .apply(lambda x: x.resample('1H').ffill()) 
     .reset_index() )
df2

                   date     group   indicator   value
0   2022-09-01 10:00:00        A    False       2
1   2022-09-01 11:00:00        A    False       2
2   2022-09-01 12:00:00        A    False       2
3   2022-09-01 13:00:00        A    False       2
4   2022-09-01 14:00:00        A    False       3
5   2022-09-01 13:00:00        B    False       1
6   2022-09-01 14:00:00        B    False       1
7   2022-09-01 15:00:00        B    False       1
8   2022-09-01 16:00:00        B    True        4

Answered By: Naveed

Answer 3

One option is with complete from pyjanitor, to expose missing rows:

# pip install pyjanitor
import pandas as pd
import janitor
df['date'] = pd.to_datetime(df['date'])

# build a dictionary to contain the new dates
# the key of the dictionary must exist in the dataframe
new_date = {'date': lambda date: pd.date_range(date.min(), date.max(), freq='H')}

df.complete(new_date, by = 'group').ffill(downcast='infer') 
  group                date  indicator  value
0     A 2022-09-01 10:00:00      False      2
1     A 2022-09-01 11:00:00      False      2
2     A 2022-09-01 12:00:00      False      2
3     A 2022-09-01 13:00:00      False      2
4     A 2022-09-01 14:00:00      False      3
5     B 2022-09-01 13:00:00      False      1
6     B 2022-09-01 14:00:00      False      1
7     B 2022-09-01 15:00:00      False      1
8     B 2022-09-01 16:00:00       True      4

Answered By: sammywemmy

Fill missing dates hourly per group with previous value in certain column using Pandas

Question:

Answers: