How to add calculated rows below each row in a pandas DataFrame

Question:

I have a dataframe_1 as such:

Index   Time          Label
0       0.000 ns      Segment 1
1       2.749 sec     baseline
2       3.459 min     begin test
3       7.009 min     end of test

And I would like to add multiple new rows in between each of dataframe_1‘s rows, where the Time column for each new row would add an additional minute until reaching dataframe_1‘s next row’s time (and corresponding Label). For example, the above table should ultimately look like this:

Index     Time               Label
0         0.000 ns           Segment 1
1         2.749 sec          baseline
2         00:01:02.749000    baseline + 1min
3         00:02:02.749000    baseline + 2min
4         00:03:02.749000    baseline + 3min
5         3.459 min          begin test
6         00:04:27.540000    begin test + 1min
7         00:05:27.540000    begin test + 2min
8         00:06:27.540000    begin test + 3min
9         7.009 min          end of test

Using Timedelta type via pd.to_timedelta() is perfectly fine.

I thought the best way to do this would be to break up each row of dataframe_1 into its own dataframe, and then adding rows for each added minute, and then concating the dataframes back together. However, I am unsure of how to accomplish this.

Should I use a nested for-loop to [first] iterate over the rows of dataframe_1 and then [second] iterate over a counter so I can create new rows with added minutes?

I was previously not splitting up the individual rows into new dataframes, and I was doing the second iteration like this:

    baseline_row = df_legend[df_legend['Label'] == 'baseline']
    [baseline_index] = baseline_row.index
    baseline_time = baseline_row['Time']

    interval_mins = 1
    new_time = baseline_time + pd.Timedelta(minutes=interval_mins)

    cutoff_time_np = df_legend.iloc[baseline_row.index + 1]['Time']
    cutoff_time = pd.to_timedelta(cutoff_time_np)
    
    while new_time.reset_index(drop=True).get(0) < cutoff_time.reset_index(drop=True).get(0):

        new_row = baseline_row.copy()
        new_row['Label'] = f'minute {interval_mins}'
        new_row['Time'] = baseline_time + pd.Timedelta(minutes=interval_mins)
        new_row.index = [baseline_index + interval_mins - 0.5]

        df_legend = df_legend.append(new_row, ignore_index=False)
        df_legend = df_legend.sort_index().reset_index(drop=True)
        pdb.set_trace()

        interval_mins += 1
        new_time = baseline_time + pd.Timedelta(minutes=interval_mins)

But since I want to do this for each row in the original dataframe_1, then I was thinking to split it up into separate dataframes and put it back together. I’m just not sure what the best way is to do that, especially since pandas is apparently very slow if iterating over the rows.

I would really appreciate some guidance.

Asked By: Raj

||

Answers:

This might faster than your solution.

df.Time = pd.to_timedelta(df.Time)
df['counts'] = df.Time.diff().apply(lambda x: x.total_seconds()) / 60
df['counts'] = np.floor(df.counts.shift(-1)).fillna(0).astype(int)
df.drop(columns='Index', inplace=True)

df

             Time        Label  counts
0        00:00:00    Segment 1       0
1 00:00:02.749000     baseline       3
2 00:03:27.540000   begin test       3
3 00:07:00.540000  end of test       0

Then use iterrows to get your desire output.

new_df = []
for _, row in df.iterrows():
    val = row.counts
    if val == 0:
        new_df.append(row)
    else:
        new_df.append(row)
        new_row = row.copy()
        label = row.Label
        for i in range(val):
            new_row = new_row.copy()
            new_row.Time += pd.Timedelta('1 min')
            new_row.Label = f'{label} + {i+1}min'
            new_df.append(new_row)

new_df = pd.DataFrame(new_df)
new_df

             Time              Label  counts
0        00:00:00          Segment 1       0
1 00:00:02.749000           baseline       3
1 00:01:02.749000    baseline + 1min       3
1 00:02:02.749000    baseline + 2min       3
1 00:03:02.749000    baseline + 3min       3
2 00:03:27.540000         begin test       3
2 00:04:27.540000  begin test + 1min       3
2 00:05:27.540000  begin test + 2min       3
2 00:06:27.540000  begin test + 3min       3
3 00:07:00.540000        end of test       0
Answered By: ResidentSleeper

I assume that you converted Time column from "number unit" format to a string
representation of the time. Something like:

               Time        Label
Index                           
0      00:00:00.000    Segment 1
1      00:00:02.749     baseline
2      00:03:27.540   begin test
3      00:07:00.540  end of test

Then, to get your result:

  1. Compute timNxt – the Time column shifted by 1 position and converted
    to datetime:

    timNxt = pd.to_datetime(df.Time.shift(-1))
    
  2. Define the following "replication" function:

    def myRepl(row):
        timCurr = pd.to_datetime(row.Time)
        timNext = timNxt[row.name]
        tbl = [[timCurr.strftime('%H:%M:%S.%f'), row.Label]]
        if pd.notna(timNext):
            n = (timNext - timCurr) // np.timedelta64(1, 'm') + 1
            tbl.extend([ [(timCurr + np.timedelta64(i, 'm')).strftime('%H:%M:%S.%f'),
                row.Label + f' + {i}min'] for i in range(1, n)])
        return pd.DataFrame(tbl, columns=row.index)
    
  3. Apply it to each row of your df and concatenate results:

    result = pd.concat(df.apply(myRepl, axis=1).tolist(), ignore_index=True)
    

The result is:

              Time              Label
0  00:00:00.000000          Segment 1
1  00:00:02.749000           baseline
2  00:01:02.749000    baseline + 1min
3  00:02:02.749000    baseline + 2min
4  00:03:02.749000    baseline + 3min
5  00:03:27.540000         begin test
6  00:04:27.540000  begin test + 1min
7  00:05:27.540000  begin test + 2min
8  00:06:27.540000  begin test + 3min
9  00:07:00.540000        end of test

The resulting DataFrame has Time column also as string, but at
least the fractional part of second has 6 digits everywhere.

Answered By: Valdi_Bo
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.