How to add calculated rows below each row in a pandas DataFrame
Question:
I have a dataframe_1
as such:
Index Time Label
0 0.000 ns Segment 1
1 2.749 sec baseline
2 3.459 min begin test
3 7.009 min end of test
And I would like to add multiple new rows in between each of dataframe_1
‘s rows, where the Time column for each new row would add an additional minute until reaching dataframe_1
‘s next row’s time (and corresponding Label). For example, the above table should ultimately look like this:
Index Time Label
0 0.000 ns Segment 1
1 2.749 sec baseline
2 00:01:02.749000 baseline + 1min
3 00:02:02.749000 baseline + 2min
4 00:03:02.749000 baseline + 3min
5 3.459 min begin test
6 00:04:27.540000 begin test + 1min
7 00:05:27.540000 begin test + 2min
8 00:06:27.540000 begin test + 3min
9 7.009 min end of test
Using Timedelta
type via pd.to_timedelta()
is perfectly fine.
I thought the best way to do this would be to break up each row of dataframe_1
into its own dataframe, and then adding rows for each added minute, and then concat
ing the dataframes back together. However, I am unsure of how to accomplish this.
Should I use a nested for-loop to [first] iterate over the rows of dataframe_1
and then [second] iterate over a counter so I can create new rows with added minutes?
I was previously not splitting up the individual rows into new dataframes, and I was doing the second iteration like this:
baseline_row = df_legend[df_legend['Label'] == 'baseline']
[baseline_index] = baseline_row.index
baseline_time = baseline_row['Time']
interval_mins = 1
new_time = baseline_time + pd.Timedelta(minutes=interval_mins)
cutoff_time_np = df_legend.iloc[baseline_row.index + 1]['Time']
cutoff_time = pd.to_timedelta(cutoff_time_np)
while new_time.reset_index(drop=True).get(0) < cutoff_time.reset_index(drop=True).get(0):
new_row = baseline_row.copy()
new_row['Label'] = f'minute {interval_mins}'
new_row['Time'] = baseline_time + pd.Timedelta(minutes=interval_mins)
new_row.index = [baseline_index + interval_mins - 0.5]
df_legend = df_legend.append(new_row, ignore_index=False)
df_legend = df_legend.sort_index().reset_index(drop=True)
pdb.set_trace()
interval_mins += 1
new_time = baseline_time + pd.Timedelta(minutes=interval_mins)
But since I want to do this for each row in the original dataframe_1
, then I was thinking to split it up into separate dataframes and put it back together. I’m just not sure what the best way is to do that, especially since pandas is apparently very slow if iterating over the rows.
I would really appreciate some guidance.
Answers:
This might faster than your solution.
df.Time = pd.to_timedelta(df.Time)
df['counts'] = df.Time.diff().apply(lambda x: x.total_seconds()) / 60
df['counts'] = np.floor(df.counts.shift(-1)).fillna(0).astype(int)
df.drop(columns='Index', inplace=True)
df
Time Label counts
0 00:00:00 Segment 1 0
1 00:00:02.749000 baseline 3
2 00:03:27.540000 begin test 3
3 00:07:00.540000 end of test 0
Then use iterrows
to get your desire output.
new_df = []
for _, row in df.iterrows():
val = row.counts
if val == 0:
new_df.append(row)
else:
new_df.append(row)
new_row = row.copy()
label = row.Label
for i in range(val):
new_row = new_row.copy()
new_row.Time += pd.Timedelta('1 min')
new_row.Label = f'{label} + {i+1}min'
new_df.append(new_row)
new_df = pd.DataFrame(new_df)
new_df
Time Label counts
0 00:00:00 Segment 1 0
1 00:00:02.749000 baseline 3
1 00:01:02.749000 baseline + 1min 3
1 00:02:02.749000 baseline + 2min 3
1 00:03:02.749000 baseline + 3min 3
2 00:03:27.540000 begin test 3
2 00:04:27.540000 begin test + 1min 3
2 00:05:27.540000 begin test + 2min 3
2 00:06:27.540000 begin test + 3min 3
3 00:07:00.540000 end of test 0
I assume that you converted Time column from "number unit" format to a string
representation of the time. Something like:
Time Label
Index
0 00:00:00.000 Segment 1
1 00:00:02.749 baseline
2 00:03:27.540 begin test
3 00:07:00.540 end of test
Then, to get your result:
-
Compute timNxt – the Time column shifted by 1 position and converted
to datetime:
timNxt = pd.to_datetime(df.Time.shift(-1))
-
Define the following "replication" function:
def myRepl(row):
timCurr = pd.to_datetime(row.Time)
timNext = timNxt[row.name]
tbl = [[timCurr.strftime('%H:%M:%S.%f'), row.Label]]
if pd.notna(timNext):
n = (timNext - timCurr) // np.timedelta64(1, 'm') + 1
tbl.extend([ [(timCurr + np.timedelta64(i, 'm')).strftime('%H:%M:%S.%f'),
row.Label + f' + {i}min'] for i in range(1, n)])
return pd.DataFrame(tbl, columns=row.index)
-
Apply it to each row of your df and concatenate results:
result = pd.concat(df.apply(myRepl, axis=1).tolist(), ignore_index=True)
The result is:
Time Label
0 00:00:00.000000 Segment 1
1 00:00:02.749000 baseline
2 00:01:02.749000 baseline + 1min
3 00:02:02.749000 baseline + 2min
4 00:03:02.749000 baseline + 3min
5 00:03:27.540000 begin test
6 00:04:27.540000 begin test + 1min
7 00:05:27.540000 begin test + 2min
8 00:06:27.540000 begin test + 3min
9 00:07:00.540000 end of test
The resulting DataFrame has Time column also as string, but at
least the fractional part of second has 6 digits everywhere.
I have a dataframe_1
as such:
Index Time Label
0 0.000 ns Segment 1
1 2.749 sec baseline
2 3.459 min begin test
3 7.009 min end of test
And I would like to add multiple new rows in between each of dataframe_1
‘s rows, where the Time column for each new row would add an additional minute until reaching dataframe_1
‘s next row’s time (and corresponding Label). For example, the above table should ultimately look like this:
Index Time Label
0 0.000 ns Segment 1
1 2.749 sec baseline
2 00:01:02.749000 baseline + 1min
3 00:02:02.749000 baseline + 2min
4 00:03:02.749000 baseline + 3min
5 3.459 min begin test
6 00:04:27.540000 begin test + 1min
7 00:05:27.540000 begin test + 2min
8 00:06:27.540000 begin test + 3min
9 7.009 min end of test
Using Timedelta
type via pd.to_timedelta()
is perfectly fine.
I thought the best way to do this would be to break up each row of dataframe_1
into its own dataframe, and then adding rows for each added minute, and then concat
ing the dataframes back together. However, I am unsure of how to accomplish this.
Should I use a nested for-loop to [first] iterate over the rows of dataframe_1
and then [second] iterate over a counter so I can create new rows with added minutes?
I was previously not splitting up the individual rows into new dataframes, and I was doing the second iteration like this:
baseline_row = df_legend[df_legend['Label'] == 'baseline']
[baseline_index] = baseline_row.index
baseline_time = baseline_row['Time']
interval_mins = 1
new_time = baseline_time + pd.Timedelta(minutes=interval_mins)
cutoff_time_np = df_legend.iloc[baseline_row.index + 1]['Time']
cutoff_time = pd.to_timedelta(cutoff_time_np)
while new_time.reset_index(drop=True).get(0) < cutoff_time.reset_index(drop=True).get(0):
new_row = baseline_row.copy()
new_row['Label'] = f'minute {interval_mins}'
new_row['Time'] = baseline_time + pd.Timedelta(minutes=interval_mins)
new_row.index = [baseline_index + interval_mins - 0.5]
df_legend = df_legend.append(new_row, ignore_index=False)
df_legend = df_legend.sort_index().reset_index(drop=True)
pdb.set_trace()
interval_mins += 1
new_time = baseline_time + pd.Timedelta(minutes=interval_mins)
But since I want to do this for each row in the original dataframe_1
, then I was thinking to split it up into separate dataframes and put it back together. I’m just not sure what the best way is to do that, especially since pandas is apparently very slow if iterating over the rows.
I would really appreciate some guidance.
This might faster than your solution.
df.Time = pd.to_timedelta(df.Time)
df['counts'] = df.Time.diff().apply(lambda x: x.total_seconds()) / 60
df['counts'] = np.floor(df.counts.shift(-1)).fillna(0).astype(int)
df.drop(columns='Index', inplace=True)
df
Time Label counts
0 00:00:00 Segment 1 0
1 00:00:02.749000 baseline 3
2 00:03:27.540000 begin test 3
3 00:07:00.540000 end of test 0
Then use iterrows
to get your desire output.
new_df = []
for _, row in df.iterrows():
val = row.counts
if val == 0:
new_df.append(row)
else:
new_df.append(row)
new_row = row.copy()
label = row.Label
for i in range(val):
new_row = new_row.copy()
new_row.Time += pd.Timedelta('1 min')
new_row.Label = f'{label} + {i+1}min'
new_df.append(new_row)
new_df = pd.DataFrame(new_df)
new_df
Time Label counts
0 00:00:00 Segment 1 0
1 00:00:02.749000 baseline 3
1 00:01:02.749000 baseline + 1min 3
1 00:02:02.749000 baseline + 2min 3
1 00:03:02.749000 baseline + 3min 3
2 00:03:27.540000 begin test 3
2 00:04:27.540000 begin test + 1min 3
2 00:05:27.540000 begin test + 2min 3
2 00:06:27.540000 begin test + 3min 3
3 00:07:00.540000 end of test 0
I assume that you converted Time column from "number unit" format to a string
representation of the time. Something like:
Time Label
Index
0 00:00:00.000 Segment 1
1 00:00:02.749 baseline
2 00:03:27.540 begin test
3 00:07:00.540 end of test
Then, to get your result:
-
Compute timNxt – the Time column shifted by 1 position and converted
to datetime:timNxt = pd.to_datetime(df.Time.shift(-1))
-
Define the following "replication" function:
def myRepl(row): timCurr = pd.to_datetime(row.Time) timNext = timNxt[row.name] tbl = [[timCurr.strftime('%H:%M:%S.%f'), row.Label]] if pd.notna(timNext): n = (timNext - timCurr) // np.timedelta64(1, 'm') + 1 tbl.extend([ [(timCurr + np.timedelta64(i, 'm')).strftime('%H:%M:%S.%f'), row.Label + f' + {i}min'] for i in range(1, n)]) return pd.DataFrame(tbl, columns=row.index)
-
Apply it to each row of your df and concatenate results:
result = pd.concat(df.apply(myRepl, axis=1).tolist(), ignore_index=True)
The result is:
Time Label
0 00:00:00.000000 Segment 1
1 00:00:02.749000 baseline
2 00:01:02.749000 baseline + 1min
3 00:02:02.749000 baseline + 2min
4 00:03:02.749000 baseline + 3min
5 00:03:27.540000 begin test
6 00:04:27.540000 begin test + 1min
7 00:05:27.540000 begin test + 2min
8 00:06:27.540000 begin test + 3min
9 00:07:00.540000 end of test
The resulting DataFrame has Time column also as string, but at
least the fractional part of second has 6 digits everywhere.