Create Multi-Index empty DataFrame to join with main DataFrame
Question:
Suppose that I have a dataframe which can be created using code below
df = pd.DataFrame(data = {'date':['2021-01-01', '2021-01-02', '2021-01-05','2021-01-02', '2021-01-03', '2021-01-05'],
'product':['A', 'A', 'A', 'B', 'B', 'B'],
'price':[10, 20, 30, 40, 50, 60]
}
)
df['date'] = pd.to_datetime(df['date'])
I want to create an empty dataframe let’s say main_df
which will contain all dates between df.date.min()
and df.date.max()
for each product
and on days where values in nan
I want to ffill
and bfill
for remaning. The resulting dataframe would be as below:
+------------+---------+-------+
| date | product | price |
+------------+---------+-------+
| 2021-01-01 | A | 10 |
| 2021-01-02 | A | 20 |
| 2021-01-03 | A | 20 |
| 2021-01-04 | A | 20 |
| 2021-01-05 | A | 30 |
| 2021-01-01 | B | 40 |
| 2021-01-02 | B | 40 |
| 2021-01-03 | B | 50 |
| 2021-01-04 | B | 50 |
| 2021-01-05 | B | 60 |
+------------+---------+-------+
Answers:
Using resample
df = pd.DataFrame(data = {'date':['2021-01-01', '2021-01-02', '2021-01-05','2021-01-02', '2021-01-03', '2021-01-06'],
'product':['A', 'A', 'A', 'B', 'B', 'B'],
'price':[10, 20, 30, 40, 50, 60]
}
)
df['date'] = pd.to_datetime(df['date'])
df
# Out:
# date product price
# 0 2021-01-01 A 10
# 1 2021-01-02 A 20
# 2 2021-01-05 A 30
# 3 2021-01-02 B 40
# 4 2021-01-03 B 50
# 5 2021-01-06 B 60
df.set_index("date").groupby("product")["price"].resample("d").ffill().reset_index()
# Out:
# product date price
# 0 A 2021-01-01 10
# 1 A 2021-01-02 20
# 2 A 2021-01-03 20
# 3 A 2021-01-04 20
# 4 A 2021-01-05 30
# 5 B 2021-01-02 40
# 6 B 2021-01-03 50
# 7 B 2021-01-04 50
# 8 B 2021-01-05 50
# 9 B 2021-01-06 60
See the rows that have been filled by ffill
:
df.set_index("date").groupby("product")["price"].resample("d").mean()
# Out:
# product date
# A 2021-01-01 10.0
# 2021-01-02 20.0
# 2021-01-03 NaN
# 2021-01-04 NaN
# 2021-01-05 30.0
# B 2021-01-02 40.0
# 2021-01-03 50.0
# 2021-01-04 NaN
# 2021-01-05 NaN
# 2021-01-06 60.0
# Name: price, dtype: float64
Note that by grouping by product before resampling and filling the empty slots, you can have different ranges (from min to max) for each product (I modified the data to showcase this).
First
make pivot table, upsampling by asfreq
and fill null
df.pivot_table('price', 'date', 'product').asfreq('D').ffill().bfill()
output:
product A B
date
2021-01-01 10.0 40.0
2021-01-02 20.0 40.0
2021-01-03 20.0 50.0
2021-01-04 20.0 50.0
2021-01-05 30.0 60.0
Second
stack
result and so on (include full code)
(df.pivot_table('price', 'date', 'product').asfreq('D').ffill().bfill()
.stack().reset_index().rename(columns={0:'price'})
.sort_values('product').reset_index(drop=True))
output:
date product price
0 2021-01-01 A 10.0
1 2021-01-02 A 20.0
2 2021-01-03 A 20.0
3 2021-01-04 A 20.0
4 2021-01-05 A 30.0
5 2021-01-01 B 40.0
6 2021-01-02 B 40.0
7 2021-01-03 B 50.0
8 2021-01-04 B 50.0
9 2021-01-05 B 60.0
Suppose that I have a dataframe which can be created using code below
df = pd.DataFrame(data = {'date':['2021-01-01', '2021-01-02', '2021-01-05','2021-01-02', '2021-01-03', '2021-01-05'],
'product':['A', 'A', 'A', 'B', 'B', 'B'],
'price':[10, 20, 30, 40, 50, 60]
}
)
df['date'] = pd.to_datetime(df['date'])
I want to create an empty dataframe let’s say main_df
which will contain all dates between df.date.min()
and df.date.max()
for each product
and on days where values in nan
I want to ffill
and bfill
for remaning. The resulting dataframe would be as below:
+------------+---------+-------+
| date | product | price |
+------------+---------+-------+
| 2021-01-01 | A | 10 |
| 2021-01-02 | A | 20 |
| 2021-01-03 | A | 20 |
| 2021-01-04 | A | 20 |
| 2021-01-05 | A | 30 |
| 2021-01-01 | B | 40 |
| 2021-01-02 | B | 40 |
| 2021-01-03 | B | 50 |
| 2021-01-04 | B | 50 |
| 2021-01-05 | B | 60 |
+------------+---------+-------+
Using resample
df = pd.DataFrame(data = {'date':['2021-01-01', '2021-01-02', '2021-01-05','2021-01-02', '2021-01-03', '2021-01-06'],
'product':['A', 'A', 'A', 'B', 'B', 'B'],
'price':[10, 20, 30, 40, 50, 60]
}
)
df['date'] = pd.to_datetime(df['date'])
df
# Out:
# date product price
# 0 2021-01-01 A 10
# 1 2021-01-02 A 20
# 2 2021-01-05 A 30
# 3 2021-01-02 B 40
# 4 2021-01-03 B 50
# 5 2021-01-06 B 60
df.set_index("date").groupby("product")["price"].resample("d").ffill().reset_index()
# Out:
# product date price
# 0 A 2021-01-01 10
# 1 A 2021-01-02 20
# 2 A 2021-01-03 20
# 3 A 2021-01-04 20
# 4 A 2021-01-05 30
# 5 B 2021-01-02 40
# 6 B 2021-01-03 50
# 7 B 2021-01-04 50
# 8 B 2021-01-05 50
# 9 B 2021-01-06 60
See the rows that have been filled by ffill
:
df.set_index("date").groupby("product")["price"].resample("d").mean()
# Out:
# product date
# A 2021-01-01 10.0
# 2021-01-02 20.0
# 2021-01-03 NaN
# 2021-01-04 NaN
# 2021-01-05 30.0
# B 2021-01-02 40.0
# 2021-01-03 50.0
# 2021-01-04 NaN
# 2021-01-05 NaN
# 2021-01-06 60.0
# Name: price, dtype: float64
Note that by grouping by product before resampling and filling the empty slots, you can have different ranges (from min to max) for each product (I modified the data to showcase this).
First
make pivot table, upsampling by asfreq
and fill null
df.pivot_table('price', 'date', 'product').asfreq('D').ffill().bfill()
output:
product A B
date
2021-01-01 10.0 40.0
2021-01-02 20.0 40.0
2021-01-03 20.0 50.0
2021-01-04 20.0 50.0
2021-01-05 30.0 60.0
Second
stack
result and so on (include full code)
(df.pivot_table('price', 'date', 'product').asfreq('D').ffill().bfill()
.stack().reset_index().rename(columns={0:'price'})
.sort_values('product').reset_index(drop=True))
output:
date product price
0 2021-01-01 A 10.0
1 2021-01-02 A 20.0
2 2021-01-03 A 20.0
3 2021-01-04 A 20.0
4 2021-01-05 A 30.0
5 2021-01-01 B 40.0
6 2021-01-02 B 40.0
7 2021-01-03 B 50.0
8 2021-01-04 B 50.0
9 2021-01-05 B 60.0