create a new dataframe with new size from an old dataframe
Question:
I have a df_train as follows:
X1
01-01-2020 | 1
01-02-2020 | 2
01-03-2020 | 3
01-04-2020 | 4
Now I want to build another df with an datetime index
I will get the datetime index as:
future_dates = pd.date_range(df_train.index.max(), periods=12, freq='M')
I want to get a new df that has a copy of df_train in the beginning and for the rest of dates we will get the average of df_train.
Desired outcome:
X1
01-05-2020 | 1
01-06-2020 | 2
01-07-2020 | 3
01-08-2020 | 4
01-09-2020 | 2.5
01-10-2020 | 2.5
01-11-2020 | 2.5
01-12-2020 | 2.5
01-01-2021 | 2.5
01-02-2021 | 2.5
01-03-2021 | 2.5
01-04-2021 | 2.5
Answers:
Convert index to_datetime
if not already:
df_train.index = pd.to_datetime(df_train.index, dayfirst=True)
Then try with Offset the index by MonthBegin
and MS
instead:
future_dates = pd.date_range(
df_train.index.max() + pd.tseries.offsets.MonthBegin(1),
periods=12,
freq='MS'
)
DatetimeIndex(['2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01',
'2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01',
'2021-01-01', '2021-02-01', '2021-03-01', '2021-04-01'],
dtype='datetime64[ns]', freq='MS')
Then create a new frame and replace the first values based on the length of df_train
:
new_df = pd.DataFrame({'X1': df_train['X1'].mean()}, index=future_dates)
new_df.iloc[:df_train.shape[0], new_df.columns.get_loc('X1')] = df_train['X1'].values
new_df
:
X1
2020-05-01 1.0
2020-06-01 2.0
2020-07-01 3.0
2020-08-01 4.0
2020-09-01 2.5
2020-10-01 2.5
2020-11-01 2.5
2020-12-01 2.5
2021-01-01 2.5
2021-02-01 2.5
2021-03-01 2.5
2021-04-01 2.5
Or build from a list comprehension:
new_df = pd.DataFrame({
'X1': [*df_train['X1'],
*(len(future_dates) - len(df_train)) * [df_train['X1'].mean()]]
}, index=future_dates)
new_df
:
X1
2020-05-01 1.0
2020-06-01 2.0
2020-07-01 3.0
2020-08-01 4.0
2020-09-01 2.5
2020-10-01 2.5
2020-11-01 2.5
2020-12-01 2.5
2021-01-01 2.5
2021-02-01 2.5
2021-03-01 2.5
2021-04-01 2.5
Then with DatetimeIndex.strftime
to restore the original formatting:
new_df.index = new_df.index.strftime('%d-%m-%Y')
X1
01-05-2020 1.0
01-06-2020 2.0
01-07-2020 3.0
01-08-2020 4.0
01-09-2020 2.5
01-10-2020 2.5
01-11-2020 2.5
01-12-2020 2.5
01-01-2021 2.5
01-02-2021 2.5
01-03-2021 2.5
01-04-2021 2.5
All Together:
import pandas as pd
df_train = pd.DataFrame({
'X1': {'01-01-2020': 1, '01-02-2020': 2, '01-03-2020': 3, '01-04-2020': 4}
})
df_train.index = pd.to_datetime(df_train.index, dayfirst=True)
future_dates = pd.date_range(
df_train.index.max() + pd.tseries.offsets.MonthBegin(1),
periods=12,
freq='MS'
)
new_df = pd.DataFrame({'X1': df_train['X1'].mean()}, index=future_dates)
new_df.iloc[:df_train.shape[0], new_df.columns.get_loc('X1')] =
df_train['X1'].values
new_df.index = new_df.index.strftime('%d-%m-%Y')
print(new_df)
set_index()
of existing rows
- create dataframe for new rows
concat()
them
import io
df_train = pd.read_csv(io.StringIO(""" X1
01-01-2020 | 1
01-02-2020 | 2
01-03-2020 | 3
01-04-2020 | 4 """), sep="|")
df_train = df_train.set_index(pd.to_datetime(df_train.index, format="%d-%m-%Y "))
df_train.columns = [c.strip() for c in df_train.columns]
future_dates = pd.date_range(df_train.index.max(), periods=12, freq='M')
pd.concat([
df_train.set_index(future_dates[0:len(df_train)]),
pd.DataFrame(index=future_dates[len(df_train):]).assign(X1=df_train["X1"].mean())
])
Here is another way:
df = df.reindex(pd.date_range(df.index.min(),periods=12,freq='MS'),fill_value=df['X1'].mean())
df = df.set_axis(df.index.shift(4))
Old Answer:
future_dates = pd.date_range(df.index.max(), periods=12, freq='M') + pd.tseries.offsets.MonthBegin()
df2 = pd.DataFrame(index = future_dates).assign(X1 = pd.Series(df['X1'].to_numpy(),index=future_dates[0:4])).fillna(df.mean())
Output:
X1
2020-05-01 1.0
2020-06-01 2.0
2020-07-01 3.0
2020-08-01 4.0
2020-09-01 2.5
2020-10-01 2.5
2020-11-01 2.5
2020-12-01 2.5
2021-01-01 2.5
2021-02-01 2.5
2021-03-01 2.5
2021-04-01 2.5
I have a df_train as follows:
X1
01-01-2020 | 1
01-02-2020 | 2
01-03-2020 | 3
01-04-2020 | 4
Now I want to build another df with an datetime index
I will get the datetime index as:
future_dates = pd.date_range(df_train.index.max(), periods=12, freq='M')
I want to get a new df that has a copy of df_train in the beginning and for the rest of dates we will get the average of df_train.
Desired outcome:
X1
01-05-2020 | 1
01-06-2020 | 2
01-07-2020 | 3
01-08-2020 | 4
01-09-2020 | 2.5
01-10-2020 | 2.5
01-11-2020 | 2.5
01-12-2020 | 2.5
01-01-2021 | 2.5
01-02-2021 | 2.5
01-03-2021 | 2.5
01-04-2021 | 2.5
Convert index to_datetime
if not already:
df_train.index = pd.to_datetime(df_train.index, dayfirst=True)
Then try with Offset the index by MonthBegin
and MS
instead:
future_dates = pd.date_range(
df_train.index.max() + pd.tseries.offsets.MonthBegin(1),
periods=12,
freq='MS'
)
DatetimeIndex(['2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01',
'2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01',
'2021-01-01', '2021-02-01', '2021-03-01', '2021-04-01'],
dtype='datetime64[ns]', freq='MS')
Then create a new frame and replace the first values based on the length of df_train
:
new_df = pd.DataFrame({'X1': df_train['X1'].mean()}, index=future_dates)
new_df.iloc[:df_train.shape[0], new_df.columns.get_loc('X1')] = df_train['X1'].values
new_df
:
X1
2020-05-01 1.0
2020-06-01 2.0
2020-07-01 3.0
2020-08-01 4.0
2020-09-01 2.5
2020-10-01 2.5
2020-11-01 2.5
2020-12-01 2.5
2021-01-01 2.5
2021-02-01 2.5
2021-03-01 2.5
2021-04-01 2.5
Or build from a list comprehension:
new_df = pd.DataFrame({
'X1': [*df_train['X1'],
*(len(future_dates) - len(df_train)) * [df_train['X1'].mean()]]
}, index=future_dates)
new_df
:
X1
2020-05-01 1.0
2020-06-01 2.0
2020-07-01 3.0
2020-08-01 4.0
2020-09-01 2.5
2020-10-01 2.5
2020-11-01 2.5
2020-12-01 2.5
2021-01-01 2.5
2021-02-01 2.5
2021-03-01 2.5
2021-04-01 2.5
Then with DatetimeIndex.strftime
to restore the original formatting:
new_df.index = new_df.index.strftime('%d-%m-%Y')
X1
01-05-2020 1.0
01-06-2020 2.0
01-07-2020 3.0
01-08-2020 4.0
01-09-2020 2.5
01-10-2020 2.5
01-11-2020 2.5
01-12-2020 2.5
01-01-2021 2.5
01-02-2021 2.5
01-03-2021 2.5
01-04-2021 2.5
All Together:
import pandas as pd
df_train = pd.DataFrame({
'X1': {'01-01-2020': 1, '01-02-2020': 2, '01-03-2020': 3, '01-04-2020': 4}
})
df_train.index = pd.to_datetime(df_train.index, dayfirst=True)
future_dates = pd.date_range(
df_train.index.max() + pd.tseries.offsets.MonthBegin(1),
periods=12,
freq='MS'
)
new_df = pd.DataFrame({'X1': df_train['X1'].mean()}, index=future_dates)
new_df.iloc[:df_train.shape[0], new_df.columns.get_loc('X1')] =
df_train['X1'].values
new_df.index = new_df.index.strftime('%d-%m-%Y')
print(new_df)
set_index()
of existing rows- create dataframe for new rows
concat()
them
import io
df_train = pd.read_csv(io.StringIO(""" X1
01-01-2020 | 1
01-02-2020 | 2
01-03-2020 | 3
01-04-2020 | 4 """), sep="|")
df_train = df_train.set_index(pd.to_datetime(df_train.index, format="%d-%m-%Y "))
df_train.columns = [c.strip() for c in df_train.columns]
future_dates = pd.date_range(df_train.index.max(), periods=12, freq='M')
pd.concat([
df_train.set_index(future_dates[0:len(df_train)]),
pd.DataFrame(index=future_dates[len(df_train):]).assign(X1=df_train["X1"].mean())
])
Here is another way:
df = df.reindex(pd.date_range(df.index.min(),periods=12,freq='MS'),fill_value=df['X1'].mean())
df = df.set_axis(df.index.shift(4))
Old Answer:
future_dates = pd.date_range(df.index.max(), periods=12, freq='M') + pd.tseries.offsets.MonthBegin()
df2 = pd.DataFrame(index = future_dates).assign(X1 = pd.Series(df['X1'].to_numpy(),index=future_dates[0:4])).fillna(df.mean())
Output:
X1
2020-05-01 1.0
2020-06-01 2.0
2020-07-01 3.0
2020-08-01 4.0
2020-09-01 2.5
2020-10-01 2.5
2020-11-01 2.5
2020-12-01 2.5
2021-01-01 2.5
2021-02-01 2.5
2021-03-01 2.5
2021-04-01 2.5