How to get start and end datetime indices of groups of consecutive values of data in pandas including repeated valus?
Question:
There are many answers based on numerical indices but I am looking for a solution that works with a DateTimeIndex and got really stuck here. The closest answer I found with a numerical index is this one but does not work for my example.
I want to get the group start and end as DateTime
for groups of n
consecutive values in a DataFrame column.
Sample data:
import pandas as pd
index = pd.date_range(
start=pd.Timestamp("2023-03-20 12:00:00+0000", tz="UTC"),
end=pd.Timestamp("2023-03-20 15:00:00+0000", tz="UTC"),
freq="15Min",
)
data = {
"values_including_constant_groups": [
2.0,
1.0,
1.0,
3.0,
3.0,
3.0,
4.0,
4.0,
4.0,
2.0,
3.0,
3.0,
1.0,
],
}
df = pd.DataFrame(
index=index,
data=data,
)
print(df)
yields:
values_including_constant_groups
2023-03-20 12:00:00+00:00 2.0
2023-03-20 12:15:00+00:00 1.0
2023-03-20 12:30:00+00:00 1.0
2023-03-20 12:45:00+00:00 3.0
2023-03-20 13:00:00+00:00 3.0
2023-03-20 13:15:00+00:00 3.0
2023-03-20 13:30:00+00:00 4.0
2023-03-20 13:45:00+00:00 4.0
2023-03-20 14:00:00+00:00 4.0
2023-03-20 14:15:00+00:00 2.0
2023-03-20 14:30:00+00:00 3.0
2023-03-20 14:45:00+00:00 3.0
2023-03-20 15:00:00+00:00 1.0
Desired output (I would be flexible here but this would be my first idea):
values_including_constant_groups group_start group_end
2023-03-20 12:00:00+00:00 2.0 NaN NaN
2023-03-20 12:15:00+00:00 1.0 True False
2023-03-20 12:30:00+00:00 1.0 False True
2023-03-20 12:45:00+00:00 3.0 True False
2023-03-20 13:00:00+00:00 3.0 False False
2023-03-20 13:15:00+00:00 3.0 False True
2023-03-20 13:30:00+00:00 4.0 True False
2023-03-20 13:45:00+00:00 4.0 False False
2023-03-20 14:00:00+00:00 4.0 False True
2023-03-20 14:15:00+00:00 2.0 NaN NaN
2023-03-20 14:30:00+00:00 3.0 True False
2023-03-20 14:45:00+00:00 3.0 False True
2023-03-20 15:00:00+00:00 1.0 NaN NaN
So only groups of n>=2
should be considered here and "single" values excluded. Moreover, repeated groups should be included.
Any hints are very welcome!
Answers:
Code
c = 'values_including_constant_groups'
# Compare current with previous and previous with current row
# to flag the rows corresponding to group start and group end
s, e = df[c] != df[c].shift(), df[c] != df[c].shift(-1)
# mask the flags where both group_start and group_end
# is True on the same row, i.e where n == 1
df['group_start'], df['group_end'] = s.mask(s & e), e.mask(s & e)
Result
values_including_constant_groups group_start group_end
2023-03-20 12:00:00+00:00 2.0 NaN NaN
2023-03-20 12:15:00+00:00 1.0 True False
2023-03-20 12:30:00+00:00 1.0 False True
2023-03-20 12:45:00+00:00 3.0 True False
2023-03-20 13:00:00+00:00 3.0 False False
2023-03-20 13:15:00+00:00 3.0 False True
2023-03-20 13:30:00+00:00 4.0 True False
2023-03-20 13:45:00+00:00 4.0 False False
2023-03-20 14:00:00+00:00 4.0 False True
2023-03-20 14:15:00+00:00 2.0 NaN NaN
2023-03-20 14:30:00+00:00 3.0 True False
2023-03-20 14:45:00+00:00 3.0 False True
2023-03-20 15:00:00+00:00 1.0 NaN NaN
There are many answers based on numerical indices but I am looking for a solution that works with a DateTimeIndex and got really stuck here. The closest answer I found with a numerical index is this one but does not work for my example.
I want to get the group start and end as DateTime
for groups of n
consecutive values in a DataFrame column.
Sample data:
import pandas as pd
index = pd.date_range(
start=pd.Timestamp("2023-03-20 12:00:00+0000", tz="UTC"),
end=pd.Timestamp("2023-03-20 15:00:00+0000", tz="UTC"),
freq="15Min",
)
data = {
"values_including_constant_groups": [
2.0,
1.0,
1.0,
3.0,
3.0,
3.0,
4.0,
4.0,
4.0,
2.0,
3.0,
3.0,
1.0,
],
}
df = pd.DataFrame(
index=index,
data=data,
)
print(df)
yields:
values_including_constant_groups
2023-03-20 12:00:00+00:00 2.0
2023-03-20 12:15:00+00:00 1.0
2023-03-20 12:30:00+00:00 1.0
2023-03-20 12:45:00+00:00 3.0
2023-03-20 13:00:00+00:00 3.0
2023-03-20 13:15:00+00:00 3.0
2023-03-20 13:30:00+00:00 4.0
2023-03-20 13:45:00+00:00 4.0
2023-03-20 14:00:00+00:00 4.0
2023-03-20 14:15:00+00:00 2.0
2023-03-20 14:30:00+00:00 3.0
2023-03-20 14:45:00+00:00 3.0
2023-03-20 15:00:00+00:00 1.0
Desired output (I would be flexible here but this would be my first idea):
values_including_constant_groups group_start group_end
2023-03-20 12:00:00+00:00 2.0 NaN NaN
2023-03-20 12:15:00+00:00 1.0 True False
2023-03-20 12:30:00+00:00 1.0 False True
2023-03-20 12:45:00+00:00 3.0 True False
2023-03-20 13:00:00+00:00 3.0 False False
2023-03-20 13:15:00+00:00 3.0 False True
2023-03-20 13:30:00+00:00 4.0 True False
2023-03-20 13:45:00+00:00 4.0 False False
2023-03-20 14:00:00+00:00 4.0 False True
2023-03-20 14:15:00+00:00 2.0 NaN NaN
2023-03-20 14:30:00+00:00 3.0 True False
2023-03-20 14:45:00+00:00 3.0 False True
2023-03-20 15:00:00+00:00 1.0 NaN NaN
So only groups of n>=2
should be considered here and "single" values excluded. Moreover, repeated groups should be included.
Any hints are very welcome!
Code
c = 'values_including_constant_groups'
# Compare current with previous and previous with current row
# to flag the rows corresponding to group start and group end
s, e = df[c] != df[c].shift(), df[c] != df[c].shift(-1)
# mask the flags where both group_start and group_end
# is True on the same row, i.e where n == 1
df['group_start'], df['group_end'] = s.mask(s & e), e.mask(s & e)
Result
values_including_constant_groups group_start group_end
2023-03-20 12:00:00+00:00 2.0 NaN NaN
2023-03-20 12:15:00+00:00 1.0 True False
2023-03-20 12:30:00+00:00 1.0 False True
2023-03-20 12:45:00+00:00 3.0 True False
2023-03-20 13:00:00+00:00 3.0 False False
2023-03-20 13:15:00+00:00 3.0 False True
2023-03-20 13:30:00+00:00 4.0 True False
2023-03-20 13:45:00+00:00 4.0 False False
2023-03-20 14:00:00+00:00 4.0 False True
2023-03-20 14:15:00+00:00 2.0 NaN NaN
2023-03-20 14:30:00+00:00 3.0 True False
2023-03-20 14:45:00+00:00 3.0 False True
2023-03-20 15:00:00+00:00 1.0 NaN NaN