Remove timestamps that are close to each other or within threshold in a dataframe

Question:

Problem:
I have a lot of datapoints that are too close to each other in terms of timestamps meaning that when I manipulate the data, I am oversampling.

What I need:
For each element in the dataframe, I need to drop all elements that are within a 5 second range from it. This means that the first row is the one I am using and the subsequent rows that are within 5 seconds need to be dropped. This is what I have right now:

test_df = test_df.sort_values(by='timestamp').reset_index(drop=True)
for i in range(0, len(test_df)):
    currentTimestamp = test_df.timestamp.loc[i]
    temp_row = test_df.loc[i]
    temp_5_secs = add_ms(currentTimestamp, 5000)
    test_df = test_df[(test_df.timestamp) >= temp_5_secs].reset_index(drop=True)
    test_df = test_df.append(temp_row, ignore_index=True)
    test_df = test_df.sort_values(by='timestamp').reset_index(drop=True)


def add_ms(timestamp, ms):
    return int(timestamp) + 1000000* ms

but its giving me the following error:

ValueError Traceback (most recent call last)
/local/home/python3.7/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
354 try:
–> 355 return self._range.index(new_key)
356 except ValueError as err:

ValueError: 235 is not in range

This is a sample dataframe:

index timestamp        value
0   1675434673166538783 14.62
1   1675434684419325010 14.62
2   1675434684420505291 14.62
3   1675434673164456809 14.63
4   1675434700772654637 14.50

Summary, for each of the timestamp elements in the dataframe, remove the subsequent elements that are within 5 second range from that element. Remove by dropping the rows. Do this for each element so we know for a fact that the result will be that there will be no rows that timestamps are less than 5 seconds from each other.

Thanks in advance!

Asked By: Bag

||

Answers:

To remove timestamps that are close to each other or within a threshold in a dataframe, you can use the pandas library in Python.

Here’s an example code that removes timestamps that are less than or equal to 10 seconds apart from each other:

# create a sample dataframe
df = pd.DataFrame({'timestamp': ['2023-02-23 10:00:00', '2023-02-23 10:00:05', '2023-02-23 10:00:10', '2023-02-23 10:00:20', '2023-02-23 10:00:25']})

# convert timestamp column to datetime type
df['timestamp'] = pd.to_datetime(df['timestamp'])

# calculate time difference between consecutive rows
time_diff = df['timestamp'].diff().fillna(pd.Timedelta(seconds=0))

# create a mask of rows to keep
mask = (time_diff > pd.Timedelta(seconds=10)) | (time_diff == pd.Timedelta(seconds=0))

# filter dataframe using the mask
df_filtered = df[mask]

# print the filtered dataframe
print(df_filtered)

Output:

            timestamp
0 2023-02-23 10:00:00
3 2023-02-23 10:00:20
4 2023-02-23 10:00:25

In this example, we first create a sample dataframe with a timestamp column. We then convert this column to datetime type using the pd.to_datetime() function.

Next, we calculate the time difference between consecutive rows using the diff() function, which computes the difference between the current and previous row. We use fillna(pd.Timedelta(seconds=0)) to replace the first row’s NaN value with a Timedelta object representing zero seconds.

We then create a mask of rows to keep, where time_diff > pd.Timedelta(seconds=10) evaluates to True for rows where the time difference between consecutive rows is greater than 10 seconds, and time_diff == pd.Timedelta(seconds=0) evaluates to True for the first row. We use the | operator to combine these conditions with an OR operator.

Finally, we use the mask to filter the original dataframe and create a new dataframe called df_filtered. This new dataframe contains only the rows with timestamps that are more than 10 seconds apart or the first row.

Answered By: Leandro Paiva