How to only return consecutive values in a pandas dataframe
Question:
I have a modified a dataframe to only include the line if its over a certain value. Now in this new dataframe, I want to modify it further to only include the lines that are consecutive. For example, if this is the dataframe with the index and then Timestamp and Value headers:
Timestamp Value
316 2022-03-01T16:28:50.4580078Z 36.000
344 2022-03-01T17:26:25.8170166Z 37.500
345 2022-03-01T17:26:45.8480072Z 41.000
346 2022-03-01T17:27:05.8800048Z 35.250
374 2022-03-01T18:23:10.9630126Z 36.000
387 2022-03-01T18:36:12.1380004Z 42.000
388 2022-03-01T18:36:32.1690063Z 37.000
391 2022-03-01T18:37:52.2950134Z 34.500
this should return a new dataframe with only 344,345,346,387,and 388 because these are the only lines that have a consecutive neighbor in the index
Timestamp Value
344 2022-03-01T17:26:25.8170166Z 37.500
345 2022-03-01T17:26:45.8480072Z 41.000
346 2022-03-01T17:27:05.8800048Z 35.250
387 2022-03-01T18:36:12.1380004Z 42.000
388 2022-03-01T18:36:32.1690063Z 37.000
I have tried iterating and appending to a new list, using df.iloc[i]
and df.iloc[i-1]
and still cant seem to get it working properly.
Answers:
based on the time
You can compute the successive time differences and compare them to a threshold (here 1 minute). Keep the rows for which the mask is True (or that of the row below):
thresh = '1min'
m = (pd.to_datetime(df['Timestamp'])
.diff().lt(thresh)
)
out = df[m|m.shift(-1)]
based on the index
Same logic but checking a successive difference of 1 in the index.
m = df.index.to_series().diff().eq(1)
out = df[m|m.shift(-1)]
Output:
Timestamp Value
344 2022-03-01 17:26:25.817016600+00:00 37.50
345 2022-03-01 17:26:45.848007200+00:00 41.00
346 2022-03-01 17:27:05.880004800+00:00 35.25
387 2022-03-01 18:36:12.138000400+00:00 42.00
388 2022-03-01 18:36:32.169006300+00:00 37.00
Intermediates:
Timestamp Value m m.shift(-1)
316 2022-03-01 16:28:50.458007800+00:00 36.00 False False
344 2022-03-01 17:26:25.817016600+00:00 37.50 False True
345 2022-03-01 17:26:45.848007200+00:00 41.00 True True
346 2022-03-01 17:27:05.880004800+00:00 35.25 True False
374 2022-03-01 18:23:10.963012600+00:00 36.00 False False
387 2022-03-01 18:36:12.138000400+00:00 42.00 False True
388 2022-03-01 18:36:32.169006300+00:00 37.00 True False
391 2022-03-01 18:37:52.295013400+00:00 34.50 False NaN
I have a modified a dataframe to only include the line if its over a certain value. Now in this new dataframe, I want to modify it further to only include the lines that are consecutive. For example, if this is the dataframe with the index and then Timestamp and Value headers:
Timestamp Value
316 2022-03-01T16:28:50.4580078Z 36.000
344 2022-03-01T17:26:25.8170166Z 37.500
345 2022-03-01T17:26:45.8480072Z 41.000
346 2022-03-01T17:27:05.8800048Z 35.250
374 2022-03-01T18:23:10.9630126Z 36.000
387 2022-03-01T18:36:12.1380004Z 42.000
388 2022-03-01T18:36:32.1690063Z 37.000
391 2022-03-01T18:37:52.2950134Z 34.500
this should return a new dataframe with only 344,345,346,387,and 388 because these are the only lines that have a consecutive neighbor in the index
Timestamp Value
344 2022-03-01T17:26:25.8170166Z 37.500
345 2022-03-01T17:26:45.8480072Z 41.000
346 2022-03-01T17:27:05.8800048Z 35.250
387 2022-03-01T18:36:12.1380004Z 42.000
388 2022-03-01T18:36:32.1690063Z 37.000
I have tried iterating and appending to a new list, using df.iloc[i]
and df.iloc[i-1]
and still cant seem to get it working properly.
based on the time
You can compute the successive time differences and compare them to a threshold (here 1 minute). Keep the rows for which the mask is True (or that of the row below):
thresh = '1min'
m = (pd.to_datetime(df['Timestamp'])
.diff().lt(thresh)
)
out = df[m|m.shift(-1)]
based on the index
Same logic but checking a successive difference of 1 in the index.
m = df.index.to_series().diff().eq(1)
out = df[m|m.shift(-1)]
Output:
Timestamp Value
344 2022-03-01 17:26:25.817016600+00:00 37.50
345 2022-03-01 17:26:45.848007200+00:00 41.00
346 2022-03-01 17:27:05.880004800+00:00 35.25
387 2022-03-01 18:36:12.138000400+00:00 42.00
388 2022-03-01 18:36:32.169006300+00:00 37.00
Intermediates:
Timestamp Value m m.shift(-1)
316 2022-03-01 16:28:50.458007800+00:00 36.00 False False
344 2022-03-01 17:26:25.817016600+00:00 37.50 False True
345 2022-03-01 17:26:45.848007200+00:00 41.00 True True
346 2022-03-01 17:27:05.880004800+00:00 35.25 True False
374 2022-03-01 18:23:10.963012600+00:00 36.00 False False
387 2022-03-01 18:36:12.138000400+00:00 42.00 False True
388 2022-03-01 18:36:32.169006300+00:00 37.00 True False
391 2022-03-01 18:37:52.295013400+00:00 34.50 False NaN