create a new dataframe with new size from an old dataframe

Question:

I have a df_train as follows:

             X1  
01-01-2020 | 1     
01-02-2020 | 2     
01-03-2020 | 3      
01-04-2020 | 4  

Now I want to build another df with an datetime index

I will get the datetime index as:

future_dates = pd.date_range(df_train.index.max(), periods=12, freq='M')

I want to get a new df that has a copy of df_train in the beginning and for the rest of dates we will get the average of df_train.

Desired outcome:

               X1  
  01-05-2020 | 1     
  01-06-2020 | 2     
  01-07-2020 | 3      
  01-08-2020 | 4 
  01-09-2020 | 2.5     
  01-10-2020 | 2.5     
  01-11-2020 | 2.5      
  01-12-2020 | 2.5 
  01-01-2021 | 2.5     
  01-02-2021 | 2.5     
  01-03-2021 | 2.5      
  01-04-2021 | 2.5  
Asked By: user2512443

||

Answers:

Convert index to_datetime if not already:

df_train.index = pd.to_datetime(df_train.index, dayfirst=True)

Then try with Offset the index by MonthBegin and MS instead:

future_dates = pd.date_range(
    df_train.index.max() + pd.tseries.offsets.MonthBegin(1),
    periods=12,
    freq='MS'
)
DatetimeIndex(['2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01',
               '2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01',
               '2021-01-01', '2021-02-01', '2021-03-01', '2021-04-01'],
              dtype='datetime64[ns]', freq='MS')

Then create a new frame and replace the first values based on the length of df_train:

new_df = pd.DataFrame({'X1': df_train['X1'].mean()}, index=future_dates)
new_df.iloc[:df_train.shape[0], new_df.columns.get_loc('X1')] = df_train['X1'].values

new_df:

             X1
2020-05-01  1.0
2020-06-01  2.0
2020-07-01  3.0
2020-08-01  4.0
2020-09-01  2.5
2020-10-01  2.5
2020-11-01  2.5
2020-12-01  2.5
2021-01-01  2.5
2021-02-01  2.5
2021-03-01  2.5
2021-04-01  2.5

Or build from a list comprehension:

new_df = pd.DataFrame({
    'X1': [*df_train['X1'],
           *(len(future_dates) - len(df_train)) * [df_train['X1'].mean()]]
}, index=future_dates)

new_df:

             X1
2020-05-01  1.0
2020-06-01  2.0
2020-07-01  3.0
2020-08-01  4.0
2020-09-01  2.5
2020-10-01  2.5
2020-11-01  2.5
2020-12-01  2.5
2021-01-01  2.5
2021-02-01  2.5
2021-03-01  2.5
2021-04-01  2.5

Then with DatetimeIndex.strftime to restore the original formatting:

new_df.index = new_df.index.strftime('%d-%m-%Y')
             X1
01-05-2020  1.0
01-06-2020  2.0
01-07-2020  3.0
01-08-2020  4.0
01-09-2020  2.5
01-10-2020  2.5
01-11-2020  2.5
01-12-2020  2.5
01-01-2021  2.5
01-02-2021  2.5
01-03-2021  2.5
01-04-2021  2.5

All Together:

import pandas as pd

df_train = pd.DataFrame({
    'X1': {'01-01-2020': 1, '01-02-2020': 2, '01-03-2020': 3, '01-04-2020': 4}
})

df_train.index = pd.to_datetime(df_train.index, dayfirst=True)
future_dates = pd.date_range(
    df_train.index.max() + pd.tseries.offsets.MonthBegin(1),
    periods=12,
    freq='MS'
)
new_df = pd.DataFrame({'X1': df_train['X1'].mean()}, index=future_dates)
new_df.iloc[:df_train.shape[0], new_df.columns.get_loc('X1')] = 
    df_train['X1'].values
new_df.index = new_df.index.strftime('%d-%m-%Y')

print(new_df)
Answered By: Henry Ecker
  • set_index() of existing rows
  • create dataframe for new rows
  • concat() them
import io

df_train = pd.read_csv(io.StringIO("""             X1  
01-01-2020 | 1     
01-02-2020 | 2     
01-03-2020 | 3      
01-04-2020 | 4  """), sep="|")
df_train = df_train.set_index(pd.to_datetime(df_train.index,  format="%d-%m-%Y "))
df_train.columns = [c.strip() for c in df_train.columns]

future_dates = pd.date_range(df_train.index.max(), periods=12, freq='M')
pd.concat([
    df_train.set_index(future_dates[0:len(df_train)]),
    pd.DataFrame(index=future_dates[len(df_train):]).assign(X1=df_train["X1"].mean())
])

Answered By: Rob Raymond

Here is another way:

df = df.reindex(pd.date_range(df.index.min(),periods=12,freq='MS'),fill_value=df['X1'].mean())

df = df.set_axis(df.index.shift(4))

Old Answer:

future_dates = pd.date_range(df.index.max(), periods=12, freq='M') + pd.tseries.offsets.MonthBegin()
df2 = pd.DataFrame(index = future_dates).assign(X1 = pd.Series(df['X1'].to_numpy(),index=future_dates[0:4])).fillna(df.mean())

Output:

             X1
2020-05-01  1.0
2020-06-01  2.0
2020-07-01  3.0
2020-08-01  4.0
2020-09-01  2.5
2020-10-01  2.5
2020-11-01  2.5
2020-12-01  2.5
2021-01-01  2.5
2021-02-01  2.5
2021-03-01  2.5
2021-04-01  2.5
Answered By: rhug123
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.