Sum n consecutive entries in a sequence
Question:
apologies if the terminology isn’t right, I’m not quite sure how to ask the question accurately.
I would like to summarise my data, which for the purposes of a MWE consists of the time (in seconds) since starting the measurement (called ‘Start scan’ here), and then the total value (called ‘Sum.’ here). I have filtered the data down to those measurements above a certain cutoff. As the measurements rise and fall with a delay, each episode above the cutoff results in a few rows of data. I would like to reduce this down, so that each separate sequence is represented only by the first row in the sequence, and, if at all possible, also to create a new variable for this row containing the original number of rows
Here’s a short example.
| | Start scan | Sum. |
|------:|-------------:|------------:|
| 38191 | 247 | 7.06487e+06 |
| 38192 | 248 | 8.64094e+06 |
| 38193 | 249 | 8.82956e+06 |
| 38194 | 250 | 1.20744e+07 |
| 38195 | 251 | 1.28402e+07 |
| 38216 | 272 | 1.85569e+07 |
| 38217 | 273 | 2.83452e+07 |
| 38218 | 274 | 2.6532e+07 |
| 38219 | 275 | 2.2365e+07 |
| 38220 | 276 | 9.0689e+06 |
| 38230 | 286 | 6.92365e+06 |
| 38231 | 287 | 1.0727e+07 |
| 38232 | 288 | 7.81559e+06 |
| 38247 | 303 | 8.14354e+06 |
| 38381 | 437 | 8.4836e+06 |
| 38382 | 438 | 1.56635e+07 |
| 38383 | 439 | 1.24743e+07 |
| 38384 | 440 | 1.84585e+07 |
| 38385 | 441 | 1.82205e+07 |
| 38386 | 442 | 1.79627e+07 |
| 38387 | 443 | 8.56578e+06 |
| 38388 | 444 | 2.28438e+07 |
| 38389 | 445 | 8.19495e+06 |
| 38391 | 447 | 1.39808e+07 |
| 38392 | 448 | 8.30781e+06 |
I would like this converted to something like:
| | Start scan | Sum. | scans summarised |
|------:|-------------:|------------:|-----------------:|
| 38191 | 247 | 7.06487e+06 | 5|
| 38216 | 272 | 1.85569e+07 | 5|
| 38230 | 286 | 6.92365e+06 | 3|
| 38247 | 303 | 8.14354e+06 | 1|
| 38381 | 437 | 8.4836e+06 | 9|
| 38391 | 447 | 1.39808e+07 | 2|
I’ve tried my best but can’t find a way to do this.
Very grateful for any help or pointers.
Answers:
Solution by title:
Sum n consecutive entries in a sequence
You can aggregate by consecutive values created by compare difference by Series.diff
with 1
and cumulative sum by Series.cumsum
, then aggregate columns in GroupBy.agg
by GroupBy.first
, GroupBy.sum
and DataFrameGroupBy.size
:
d = {'Start scan':('Start scan','first'),
'Sum.':('Sum.','sum'),
'scans summarised':('Sum.','size')}
out = df.groupby(df['Start scan'].diff().ne(1).cumsum()).agg(**d).reset_index(drop=True)
print (out)
Start scan Sum. scans summarised
0 247 49449970.0 5
1 272 104868000.0 5
2 286 25466240.0 3
3 303 8143540.0 1
4 437 130867630.0 9
5 447 22288610.0 2
If original indices are important:
d = {'i':('index','first'),
'Start scan':('Start scan','first'),
'Sum.':('Sum.','sum'),
'scans summarised':('Sum.','size')}
df1 = df.reset_index()
out = (df1.groupby(df1['Start scan'].diff().ne(1).cumsum())
.agg(**d).set_index('i').rename_axis(None))
print (out)
Start scan Sum. scans summarised
38191 247 49449970.0 5
38216 272 104868000.0 5
38230 286 25466240.0 3
38247 303 8143540.0 1
38381 437 130867630.0 9
38391 447 22288610.0 2
Solution by ouput data:
d = {'i':('index','first'),
'Start scan':('Start scan','first'),
'Sum.':('Sum.','first'),
'scans summarised':('Sum.','size')}
df1 = df.reset_index()
out = (df1.groupby(df1['Start scan'].diff().ne(1).cumsum())
.agg(**d).set_index('i').rename_axis(None))
print (out)
Create groups by counting the difference between Start Scan
and apply cumulative sum. After grouping, create a dictionary of function to aggregate values. The code preserve your index to get expected output:
out = (df.reset_index().assign(group=lambda x: x['Start scan'].diff().gt(1).cumsum())
.groupby('group')
.agg(**{'index': ('index', 'first'),
'Start scan': ('Start scan', 'first'),
'Sum.': ('Sum.', 'first'),
'scans summarised': ('Start scan', 'count')})
.set_index('index').rename_axis(None))
print(out)
# Output
Start scan Sum. scans summarised
38191 247 7064870.0 5
38216 272 18556900.0 5
38230 286 6923650.0 3
38247 303 8143540.0 1
38381 437 8483600.0 9
38391 447 13980800.0 2
Note: in your expected output, you take the first value for your the Sum.
column. I think you want to sum on it, so replace ('Sum.', 'first')
by ('Sum.', 'sum')
if needed.
apologies if the terminology isn’t right, I’m not quite sure how to ask the question accurately.
I would like to summarise my data, which for the purposes of a MWE consists of the time (in seconds) since starting the measurement (called ‘Start scan’ here), and then the total value (called ‘Sum.’ here). I have filtered the data down to those measurements above a certain cutoff. As the measurements rise and fall with a delay, each episode above the cutoff results in a few rows of data. I would like to reduce this down, so that each separate sequence is represented only by the first row in the sequence, and, if at all possible, also to create a new variable for this row containing the original number of rows
Here’s a short example.
| | Start scan | Sum. |
|------:|-------------:|------------:|
| 38191 | 247 | 7.06487e+06 |
| 38192 | 248 | 8.64094e+06 |
| 38193 | 249 | 8.82956e+06 |
| 38194 | 250 | 1.20744e+07 |
| 38195 | 251 | 1.28402e+07 |
| 38216 | 272 | 1.85569e+07 |
| 38217 | 273 | 2.83452e+07 |
| 38218 | 274 | 2.6532e+07 |
| 38219 | 275 | 2.2365e+07 |
| 38220 | 276 | 9.0689e+06 |
| 38230 | 286 | 6.92365e+06 |
| 38231 | 287 | 1.0727e+07 |
| 38232 | 288 | 7.81559e+06 |
| 38247 | 303 | 8.14354e+06 |
| 38381 | 437 | 8.4836e+06 |
| 38382 | 438 | 1.56635e+07 |
| 38383 | 439 | 1.24743e+07 |
| 38384 | 440 | 1.84585e+07 |
| 38385 | 441 | 1.82205e+07 |
| 38386 | 442 | 1.79627e+07 |
| 38387 | 443 | 8.56578e+06 |
| 38388 | 444 | 2.28438e+07 |
| 38389 | 445 | 8.19495e+06 |
| 38391 | 447 | 1.39808e+07 |
| 38392 | 448 | 8.30781e+06 |
I would like this converted to something like:
| | Start scan | Sum. | scans summarised |
|------:|-------------:|------------:|-----------------:|
| 38191 | 247 | 7.06487e+06 | 5|
| 38216 | 272 | 1.85569e+07 | 5|
| 38230 | 286 | 6.92365e+06 | 3|
| 38247 | 303 | 8.14354e+06 | 1|
| 38381 | 437 | 8.4836e+06 | 9|
| 38391 | 447 | 1.39808e+07 | 2|
I’ve tried my best but can’t find a way to do this.
Very grateful for any help or pointers.
Solution by title:
Sum n consecutive entries in a sequence
You can aggregate by consecutive values created by compare difference by Series.diff
with 1
and cumulative sum by Series.cumsum
, then aggregate columns in GroupBy.agg
by GroupBy.first
, GroupBy.sum
and DataFrameGroupBy.size
:
d = {'Start scan':('Start scan','first'),
'Sum.':('Sum.','sum'),
'scans summarised':('Sum.','size')}
out = df.groupby(df['Start scan'].diff().ne(1).cumsum()).agg(**d).reset_index(drop=True)
print (out)
Start scan Sum. scans summarised
0 247 49449970.0 5
1 272 104868000.0 5
2 286 25466240.0 3
3 303 8143540.0 1
4 437 130867630.0 9
5 447 22288610.0 2
If original indices are important:
d = {'i':('index','first'),
'Start scan':('Start scan','first'),
'Sum.':('Sum.','sum'),
'scans summarised':('Sum.','size')}
df1 = df.reset_index()
out = (df1.groupby(df1['Start scan'].diff().ne(1).cumsum())
.agg(**d).set_index('i').rename_axis(None))
print (out)
Start scan Sum. scans summarised
38191 247 49449970.0 5
38216 272 104868000.0 5
38230 286 25466240.0 3
38247 303 8143540.0 1
38381 437 130867630.0 9
38391 447 22288610.0 2
Solution by ouput data:
d = {'i':('index','first'),
'Start scan':('Start scan','first'),
'Sum.':('Sum.','first'),
'scans summarised':('Sum.','size')}
df1 = df.reset_index()
out = (df1.groupby(df1['Start scan'].diff().ne(1).cumsum())
.agg(**d).set_index('i').rename_axis(None))
print (out)
Create groups by counting the difference between Start Scan
and apply cumulative sum. After grouping, create a dictionary of function to aggregate values. The code preserve your index to get expected output:
out = (df.reset_index().assign(group=lambda x: x['Start scan'].diff().gt(1).cumsum())
.groupby('group')
.agg(**{'index': ('index', 'first'),
'Start scan': ('Start scan', 'first'),
'Sum.': ('Sum.', 'first'),
'scans summarised': ('Start scan', 'count')})
.set_index('index').rename_axis(None))
print(out)
# Output
Start scan Sum. scans summarised
38191 247 7064870.0 5
38216 272 18556900.0 5
38230 286 6923650.0 3
38247 303 8143540.0 1
38381 437 8483600.0 9
38391 447 13980800.0 2
Note: in your expected output, you take the first value for your the Sum.
column. I think you want to sum on it, so replace ('Sum.', 'first')
by ('Sum.', 'sum')
if needed.