Pandas choose the latest datetime within an interval of 5 minutes

Question:

I am trying to choose the latest time value for a measurement type within an interval of 5 minutes.

data = [
        ["2017-01-03T10:04:45", "A", "35.79"],
        ["2017-01-03T10:01:18", "B", "98.78"],
        ["2017-01-03T10:09:07", "A", "35.01"],
        ["2017-01-03T10:03:34", "B", "96.49"],
        ["2017-01-03T10:02:01", "A", "35.82"],
        ["2017-01-03T10:05:00", "B", "97.17"],
        ["2017-01-03T10:05:01", "B", "95.08"]
       ]

df = pd.DataFrame(data, columns=["timestamp", "measurement_type", "measurement_value"])
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['measurement_value'] = df['measurement_value'].astype(float)

My DataFrame df:

timestamp measurement_type measurement_value
2017-01-03 10:04:45 A 35.79
2017-01-03 10:01:18 B 98.78
2017-01-03 10:09:07 A 35.01
2017-01-03 10:03:34 B 96.49
2017-01-03 10:02:01 A 35.82
2017-01-03 10:05:00 B 97.17
2017-01-03 10:05:01 B 95.08

the desired output ist:

timestamp measurement_type measurement_value
2017-01-03 10:05:00 A 35.79
2017-01-03 10:10:00 A 35.01
2017-01-03 10:05:00 B 97.17
2017-01-03 10:10:00 B 95.08

so it should be roundep up to the 5 minutes the timestamp is within. And if the timestamp is equal to the interval boundery it should be considered for the current interval
for example:
| 2017-01-03 10:05:00 | B | 97.17 |
belongs to 10:05:00 and not 10:10:00

I found this

df.groupby(["measurement_type", pd.Grouper(key="timestamp", freq="5min", offset="1sec")])["timestamp"].max()

which puts out:

measurement_type  timestamp          
SPO2              2017-01-03 10:00:01   2017-01-03 10:05:00
                  2017-01-03 10:05:01   2017-01-03 10:05:01
TEMP              2017-01-03 10:00:01   2017-01-03 10:04:45
                  2017-01-03 10:05:01   2017-01-03 10:09:07
Name: timestamp, dtype: datetime64[ns]

but I have the following problems with this:

  1. the datetime is rounded off but I need it rounded up (I solved this by just adding 5 minutes to each datetime but maybe there is a better option?)
  2. I used offset="1sec" so the 10:05:00 is considered for the 10:00:00 interval and not the 10:05:00 (keep in mind this is correct because the timestamps are getting rounded off but I actually need them rounded up). I think I will just subtract one second from the timestamp again to get rid of the 01.
  3. the output is a Series where the measurement_value gets lost. Is it possible to keep the measurement_value column with correspondig value for the chosen timestamp? I mean like getting a DataFrame like in the desired output

I hope I formulated my question in a understandable way

Asked By: Samat

||

Answers:

Create a 5min freq grouper with parameters closed='right' and label='right', then sort the dataframe on timestamp then groupby and aggregate with last to select the row with latest timestamp in the 5 min intervals

g = pd.Grouper(key='timestamp', freq='5min', label='right', closed='right')
df.sort_values('timestamp').groupby(['measurement_type', g]).last().reset_index()

Result

  measurement_type           timestamp  measurement_value
0                A 2017-01-03 10:05:00              35.79
1                A 2017-01-03 10:10:00              35.01
2                B 2017-01-03 10:05:00              97.17
3                B 2017-01-03 10:10:00              95.08
Answered By: Shubham Sharma