Extending a dataframe, filling in missing time, and keeping the other column values with the corresponding time?

Question:

More about my problem, I have a 2 column dataframe (one information based, and one time based) that is ~190k rows long. I am missing some dates, and would like to fill in the missing dates while keeping the information with the correct date, and the come back and resample the missing information using the interpolate refill sample.

What I’ve tried so far:

data_res = pd.period_range(min(data.time), max(data.time), freq = 'H')
...: data_res.reindex(data_res)

which was great, because it gave me the output and the correct times I needed.

(PeriodIndex(['1984-10-25 09:00', '1984-10-25 10:00', '1984-10-25 11:00',
          '1984-10-25 12:00', '1984-10-25 13:00', '1984-10-25 14:00',
          '1984-10-25 15:00', '1984-10-25 16:00', '1984-10-25 17:00',
          '1984-10-25 18:00',
          ...
          '2022-08-16 09:00', '2022-08-16 10:00', '2022-08-16 11:00',
          '2022-08-16 12:00', '2022-08-16 13:00', '2022-08-16 14:00',
          '2022-08-16 15:00', '2022-08-16 16:00', '2022-08-16 17:00',
          '2022-08-16 18:00'],
         dtype='period[H]', length=211460),

Then I used:

data1['time'] = pd.Series(data_res)

which had no issues. Finally I printed the table to double check and the output was:

189741 rows × 2 columns

Where instead of going from late 1980s to 2022 like when I imported it, it now cuts off at 2006. I understand that the dataframe gets cut off when the row length of the time matches the information column. My problem seems to be twofold: insert the full time, and keep the information column values with the corresponding dates. I tried looking for similar problems, but everything I found was inserting a shorter column into a dataframe and filling the extra values with NA – which is similarish to what I need, but not helpful towards pointing me in the right direction. Does anyone have any ideas on how to fix this? I would also be happy to provide extra information if needed.

Asked By: csingleton19

||

Answers:

Try a merge of 2 dfs instead of pd.Series.

Time Series Range (using an 8 hr range for simplicity):

import pandas as pd
import numpy as np
s = pd.to_datetime("2022-08-01 00:00:00")
e = pd.to_datetime("2022-08-01 08:00:00")

data_res = pd.DataFrame(pd.period_range(s, e, freq = 'H'), columns = ['Time'])

DF with Data (Just sampling a few of the rows from the full dataset and adding random values as info)

actual_data = data_res.sample(n=6).sort_values(by=['Time'])
actual_data['info'] = np.random.random(size=len(actual_data))
    Time                info
0   2022-08-01 00:00    0.549414
2   2022-08-01 02:00    0.746876
3   2022-08-01 03:00    0.715491
5   2022-08-01 05:00    0.521234
6   2022-08-01 06:00    0.822393
7   2022-08-01 07:00    0.430862

You can then merge these 2 dfs on time and fill nulls from there – I’m filling with 0, but you can use whatever interpolation method is needed.

data_joined = pd.merge(data_res, actual_data, on=['Time'], how = 'left').fillna({'info': 0})
    Time                info
0   2022-08-01 00:00    0.549414
1   2022-08-01 01:00    0.000000
2   2022-08-01 02:00    0.746876
3   2022-08-01 03:00    0.715491
4   2022-08-01 04:00    0.000000
5   2022-08-01 05:00    0.521234
6   2022-08-01 06:00    0.822393
7   2022-08-01 07:00    0.430862
8   2022-08-01 08:00    0.000000
Answered By: lesk_s