Expanding pandas data frame with date range in columns

Question:

I have a pandas dataframe with dates and strings similar to this:

Start        End           Note    Item
2016-10-22   2016-11-05    Z       A
2017-02-11   2017-02-25    W       B

I need to expand/transform it to the below, filling in weeks (W-SAT) in between the Start and End columns and forward filling the data in Note and Items:

Start        Note    Item
2016-10-22   Z       A
2016-10-29   Z       A
2016-11-05   Z       A
2017-02-11   W       B
2017-02-18   W       B
2017-02-25   W       B

What’s the best way to do this with pandas? Some sort of multi-index apply?

Asked By: claybot

||

Answers:

You can iterate over each row and create a new dataframe and then concatenate them together

pd.concat([pd.DataFrame({'Start': pd.date_range(row.Start, row.End, freq='W-SAT'),
               'Note': row.Note,
               'Item': row.Item}, columns=['Start', 'Note', 'Item']) 
           for i, row in df.iterrows()], ignore_index=True)

       Start Note Item
0 2016-10-22    Z    A
1 2016-10-29    Z    A
2 2016-11-05    Z    A
3 2017-02-11    W    B
4 2017-02-18    W    B
5 2017-02-25    W    B
Answered By: Ted Petrou

If the number of unique values of df['End'] - df['Start'] is not too large, but the number of rows in your dataset is large, then the following function will be much faster than looping over your dataset:

def date_expander(dataframe: pd.DataFrame,
                  start_dt_colname: str,
                  end_dt_colname: str,
                  time_unit: str,
                  new_colname: str,
                  end_inclusive: bool) -> pd.DataFrame:
    td = pd.Timedelta(1, time_unit)

    # add a timediff column:
    dataframe['_dt_diff'] = dataframe[end_dt_colname] - dataframe[start_dt_colname]

    # get the maximum timediff:
    max_diff = int((dataframe['_dt_diff'] / td).max())

    # for each possible timediff, get the intermediate time-differences:
    df_diffs = pd.concat([pd.DataFrame({'_to_add': np.arange(0, dt_diff + end_inclusive) * td}).assign(_dt_diff=dt_diff * td)
                          for dt_diff in range(max_diff + 1)])

    # join to the original dataframe
    data_expanded = dataframe.merge(df_diffs, on='_dt_diff')

    # the new dt column is just start plus the intermediate diffs:
    data_expanded[new_colname] = data_expanded[start_dt_colname] + data_expanded['_to_add']

    # remove start-end cols, as well as temp cols used for calculations:
    to_drop = [start_dt_colname, end_dt_colname, '_to_add', '_dt_diff']
    if new_colname in to_drop:
        to_drop.remove(new_colname)
    data_expanded = data_expanded.drop(columns=to_drop)

    # don't modify dataframe in place:
    del dataframe['_dt_diff']

    return data_expanded
Answered By: jwdink

You don’t need iteration at all.

df_start_end = df.melt(id_vars=['Note','Item'],value_name='date')

df = df_start_end.groupby('Note').apply(lambda x: x.set_index('date').resample('W').pad()).drop(columns=['Note','variable']).reset_index()
Answered By: Gen

So I recently spent a bit of time trying to figure out an efficient pandas-based approach to this issue (which is very trivial with data.table in R) and wanted to share the approach I came up with here:

df.set_index("Note").apply(
    lambda row: pd.date_range(row["Start"], row["End"], freq="W-SAT").values, axis=1
).explode()

Note: using .values makes a big difference in performance!

There are quite a few solutions here already and I wanted to compare the speed for different numbers of rows and periods – see results (in seconds) below:

  • n_rows is the number of initial rows and n_periods is the number of periods per row i.e. the windows size: the combinations below always result in 1 million rows when expanded
  • the other columns are named after the posters of the solutions
  • note I made a slight tweak to Gen’s approach whereby, after pd.melt(), I do df.set_index("date").groupby("Note").resample("W-SAT").ffill() – I labelled this Gen2 and it seems to perform slightly better and gives the same result
  • each n_rows, n_periods combination was ran 10 times and results were then averaged

Anyway, jwdink’s solution looks like a winner when there are many rows and few periods, whereas my solution seems to better on the other end of the spectrum, though only marginally ahead of the others as the number of rows decreases:

n_rows n_periods jwdink TedPetrou Gen Gen2 robbie
250 4000 6.63 0.33 0.64 0.45 0.28
500 2000 3.21 0.65 1.18 0.81 0.34
1000 1000 1.57 1.28 2.30 1.60 0.48
2000 500 0.83 2.57 4.68 3.24 0.71
5000 200 0.40 6.10 13.26 9.59 1.43

If you want to run your own tests on this, my code is available in my GitHub repo – note I created a DateExpander class object that wraps all the functions to make it easier to scale the simulation.

Also, for reference, I used a 2-core STANDARD_DS11_V2 Azure VM – only for about 10 minutes, so this literally me giving my 2 cents on the issue!

Answered By: robbie
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.