Expanding pandas data frame with date range in columns
Question:
I have a pandas dataframe with dates and strings similar to this:
Start End Note Item
2016-10-22 2016-11-05 Z A
2017-02-11 2017-02-25 W B
I need to expand/transform it to the below, filling in weeks (W-SAT) in between the Start and End columns and forward filling the data in Note and Items:
Start Note Item
2016-10-22 Z A
2016-10-29 Z A
2016-11-05 Z A
2017-02-11 W B
2017-02-18 W B
2017-02-25 W B
What’s the best way to do this with pandas? Some sort of multi-index apply?
Answers:
You can iterate over each row and create a new dataframe and then concatenate them together
pd.concat([pd.DataFrame({'Start': pd.date_range(row.Start, row.End, freq='W-SAT'),
'Note': row.Note,
'Item': row.Item}, columns=['Start', 'Note', 'Item'])
for i, row in df.iterrows()], ignore_index=True)
Start Note Item
0 2016-10-22 Z A
1 2016-10-29 Z A
2 2016-11-05 Z A
3 2017-02-11 W B
4 2017-02-18 W B
5 2017-02-25 W B
If the number of unique values of df['End'] - df['Start']
is not too large, but the number of rows in your dataset is large, then the following function will be much faster than looping over your dataset:
def date_expander(dataframe: pd.DataFrame,
start_dt_colname: str,
end_dt_colname: str,
time_unit: str,
new_colname: str,
end_inclusive: bool) -> pd.DataFrame:
td = pd.Timedelta(1, time_unit)
# add a timediff column:
dataframe['_dt_diff'] = dataframe[end_dt_colname] - dataframe[start_dt_colname]
# get the maximum timediff:
max_diff = int((dataframe['_dt_diff'] / td).max())
# for each possible timediff, get the intermediate time-differences:
df_diffs = pd.concat([pd.DataFrame({'_to_add': np.arange(0, dt_diff + end_inclusive) * td}).assign(_dt_diff=dt_diff * td)
for dt_diff in range(max_diff + 1)])
# join to the original dataframe
data_expanded = dataframe.merge(df_diffs, on='_dt_diff')
# the new dt column is just start plus the intermediate diffs:
data_expanded[new_colname] = data_expanded[start_dt_colname] + data_expanded['_to_add']
# remove start-end cols, as well as temp cols used for calculations:
to_drop = [start_dt_colname, end_dt_colname, '_to_add', '_dt_diff']
if new_colname in to_drop:
to_drop.remove(new_colname)
data_expanded = data_expanded.drop(columns=to_drop)
# don't modify dataframe in place:
del dataframe['_dt_diff']
return data_expanded
You don’t need iteration at all.
df_start_end = df.melt(id_vars=['Note','Item'],value_name='date')
df = df_start_end.groupby('Note').apply(lambda x: x.set_index('date').resample('W').pad()).drop(columns=['Note','variable']).reset_index()
So I recently spent a bit of time trying to figure out an efficient pandas
-based approach to this issue (which is very trivial with data.table
in R
) and wanted to share the approach I came up with here:
df.set_index("Note").apply(
lambda row: pd.date_range(row["Start"], row["End"], freq="W-SAT").values, axis=1
).explode()
Note: using .values
makes a big difference in performance!
There are quite a few solutions here already and I wanted to compare the speed for different numbers of rows and periods – see results (in seconds) below:
- n_rows is the number of initial rows and n_periods is the number of periods per row i.e. the windows size: the combinations below always result in 1 million rows when expanded
- the other columns are named after the posters of the solutions
- note I made a slight tweak to Gen’s approach whereby, after
pd.melt()
, I do df.set_index("date").groupby("Note").resample("W-SAT").ffill()
– I labelled this Gen2 and it seems to perform slightly better and gives the same result
- each n_rows, n_periods combination was ran 10 times and results were then averaged
Anyway, jwdink’s solution looks like a winner when there are many rows and few periods, whereas my solution seems to better on the other end of the spectrum, though only marginally ahead of the others as the number of rows decreases:
n_rows
n_periods
jwdink
TedPetrou
Gen
Gen2
robbie
250
4000
6.63
0.33
0.64
0.45
0.28
500
2000
3.21
0.65
1.18
0.81
0.34
1000
1000
1.57
1.28
2.30
1.60
0.48
2000
500
0.83
2.57
4.68
3.24
0.71
5000
200
0.40
6.10
13.26
9.59
1.43
If you want to run your own tests on this, my code is available in my GitHub repo – note I created a DateExpander
class object that wraps all the functions to make it easier to scale the simulation.
Also, for reference, I used a 2-core STANDARD_DS11_V2 Azure VM – only for about 10 minutes, so this literally me giving my 2 cents on the issue!
I have a pandas dataframe with dates and strings similar to this:
Start End Note Item
2016-10-22 2016-11-05 Z A
2017-02-11 2017-02-25 W B
I need to expand/transform it to the below, filling in weeks (W-SAT) in between the Start and End columns and forward filling the data in Note and Items:
Start Note Item
2016-10-22 Z A
2016-10-29 Z A
2016-11-05 Z A
2017-02-11 W B
2017-02-18 W B
2017-02-25 W B
What’s the best way to do this with pandas? Some sort of multi-index apply?
You can iterate over each row and create a new dataframe and then concatenate them together
pd.concat([pd.DataFrame({'Start': pd.date_range(row.Start, row.End, freq='W-SAT'),
'Note': row.Note,
'Item': row.Item}, columns=['Start', 'Note', 'Item'])
for i, row in df.iterrows()], ignore_index=True)
Start Note Item
0 2016-10-22 Z A
1 2016-10-29 Z A
2 2016-11-05 Z A
3 2017-02-11 W B
4 2017-02-18 W B
5 2017-02-25 W B
If the number of unique values of df['End'] - df['Start']
is not too large, but the number of rows in your dataset is large, then the following function will be much faster than looping over your dataset:
def date_expander(dataframe: pd.DataFrame,
start_dt_colname: str,
end_dt_colname: str,
time_unit: str,
new_colname: str,
end_inclusive: bool) -> pd.DataFrame:
td = pd.Timedelta(1, time_unit)
# add a timediff column:
dataframe['_dt_diff'] = dataframe[end_dt_colname] - dataframe[start_dt_colname]
# get the maximum timediff:
max_diff = int((dataframe['_dt_diff'] / td).max())
# for each possible timediff, get the intermediate time-differences:
df_diffs = pd.concat([pd.DataFrame({'_to_add': np.arange(0, dt_diff + end_inclusive) * td}).assign(_dt_diff=dt_diff * td)
for dt_diff in range(max_diff + 1)])
# join to the original dataframe
data_expanded = dataframe.merge(df_diffs, on='_dt_diff')
# the new dt column is just start plus the intermediate diffs:
data_expanded[new_colname] = data_expanded[start_dt_colname] + data_expanded['_to_add']
# remove start-end cols, as well as temp cols used for calculations:
to_drop = [start_dt_colname, end_dt_colname, '_to_add', '_dt_diff']
if new_colname in to_drop:
to_drop.remove(new_colname)
data_expanded = data_expanded.drop(columns=to_drop)
# don't modify dataframe in place:
del dataframe['_dt_diff']
return data_expanded
You don’t need iteration at all.
df_start_end = df.melt(id_vars=['Note','Item'],value_name='date')
df = df_start_end.groupby('Note').apply(lambda x: x.set_index('date').resample('W').pad()).drop(columns=['Note','variable']).reset_index()
So I recently spent a bit of time trying to figure out an efficient pandas
-based approach to this issue (which is very trivial with data.table
in R
) and wanted to share the approach I came up with here:
df.set_index("Note").apply(
lambda row: pd.date_range(row["Start"], row["End"], freq="W-SAT").values, axis=1
).explode()
Note: using .values
makes a big difference in performance!
There are quite a few solutions here already and I wanted to compare the speed for different numbers of rows and periods – see results (in seconds) below:
- n_rows is the number of initial rows and n_periods is the number of periods per row i.e. the windows size: the combinations below always result in 1 million rows when expanded
- the other columns are named after the posters of the solutions
- note I made a slight tweak to Gen’s approach whereby, after
pd.melt()
, I dodf.set_index("date").groupby("Note").resample("W-SAT").ffill()
– I labelled this Gen2 and it seems to perform slightly better and gives the same result - each n_rows, n_periods combination was ran 10 times and results were then averaged
Anyway, jwdink’s solution looks like a winner when there are many rows and few periods, whereas my solution seems to better on the other end of the spectrum, though only marginally ahead of the others as the number of rows decreases:
n_rows | n_periods | jwdink | TedPetrou | Gen | Gen2 | robbie |
---|---|---|---|---|---|---|
250 | 4000 | 6.63 | 0.33 | 0.64 | 0.45 | 0.28 |
500 | 2000 | 3.21 | 0.65 | 1.18 | 0.81 | 0.34 |
1000 | 1000 | 1.57 | 1.28 | 2.30 | 1.60 | 0.48 |
2000 | 500 | 0.83 | 2.57 | 4.68 | 3.24 | 0.71 |
5000 | 200 | 0.40 | 6.10 | 13.26 | 9.59 | 1.43 |
If you want to run your own tests on this, my code is available in my GitHub repo – note I created a DateExpander
class object that wraps all the functions to make it easier to scale the simulation.
Also, for reference, I used a 2-core STANDARD_DS11_V2 Azure VM – only for about 10 minutes, so this literally me giving my 2 cents on the issue!