Pandas label column values between 100 and 160 as Y, and between 100 and 300 as N based on timestamp column

Question

I have timeseries data with 10 secondly data, in which a column’s values start in the range of 100 and 160 for around 2 minutes and then after some time another range of values start between 100 and 300+, I want to label first range values as Y and second range values as N.
The problem is that second range (100-300+) also includes values of first range (100-160). Below is snap of data with the required Label column:

TimeStamp   Data    Label
2022-09-20 14:57:28.900 13.656  
2022-09-20 14:57:38.900 21.1306 
2022-09-20 14:58:39.200 75.877  
2022-09-20 14:58:49.200 85.3981 
2022-09-20 14:58:59.200 98.7678 
2022-09-20 14:59:09.300 107.11  Y
2022-09-20 14:59:19.300 125.618 Y
2022-09-20 14:59:29.400 126.108 Y
2022-09-20 14:59:39.400 124.506 Y
2022-09-20 14:59:49.400 124.172 Y
2022-09-20 14:59:59.500 124.528 Y
2022-09-20 15:00:09.500 121.191 Y
2022-09-20 15:00:19.500 113.049 Y
2022-09-20 15:00:29.500 91.2932 
2022-09-20 15:00:39.600 76.8781 
2022-09-20 15:00:49.600 55.4778 
2022-09-20 15:00:59.600 41.0849 
2022-09-20 15:02:09.800 8.02791 
2022-09-20 15:03:00.000 27.2703 
2022-09-20 15:03:10.000 36.658  
2022-09-20 15:04:10.100 83.0846 
2022-09-20 15:04:20.100 101.4913    N
2022-09-20 15:05:40.400 152.869 N
2022-09-20 15:05:50.400 161.967 N
2022-09-20 15:06:00.400 166.862 N
2022-09-20 15:08:40.900 294.93  N
2022-09-20 15:08:50.900 280.092 N
2022-09-20 15:09:00.900 261.405 N
2022-09-20 15:09:11.000 237.291 N
2022-09-20 15:09:21.000 219.584 N
2022-09-20 15:09:31.000 191.888 N
2022-09-20 15:09:41.100 172.979 N
2022-09-20 15:09:51.100 144.505 N
2022-09-20 15:10:01.100 125.596 N
2022-09-20 15:10:11.100 102.883 N
2022-09-20 15:11:11.300 19.6846 
2022-09-20 15:11:21.400 17.816  
2022-09-20 15:11:31.400 27.8932 
2022-09-20 15:11:41.400 23.1549 
2022-09-20 15:11:51.400 14.4569

Any help please?

Asked By: Pardeep Kumar

||

Source

Answer 1

A lambda function applied to a series does the trick here.

df["new_column"] = df.TimeStamp.apply(lambda x: my_condition(x))

That my_condition function is one you can define. It might do something like:

def my_condition(x):
    if x <= my_datetime:
        return "Y"
    return "N"

Or you can do it all in one line

df["new_column"] = df.TimeStamp.apply(lambda x: "Y" if x <= my_datetime else "N")

EDIT

After the question is edited, it’s clear a new approach is needed. We need to be able to identify the first occurrence of a sequence of rows for which the condition Y is meet (between 100 and 160).

I’d approach by finding the first instance where the condition is met.

idx_start = ((df.Data >= 100) & (df.Data <= 160)).idxmax()

The way this works is the boolean conditions create a boolean mask; every row gets either true or false. The idxmax() converts the bools to ints (1 or 0) and finds the first instance of the max value (i.e. the first instance of 1, which is the first instance of true)

Next up find the first instance where the condition is no longer met. In this case, it’s the first instance where the value is not in the range 100-160, but the previous row IS in the range.

idx_end = ((df.Data < 100) | (df.Data > 160)) & ((df.shift(1).Data >= 100) & (df.shift(1).Data <= 160))

The boolean mask approach is identical, but the shift function is used to identify the previous row, to apply a condition to it.

Once you have the start and end indices, it’s trivial to mark the rows as Y

Answered By: Alan

Answer 2

As a variant of the solution – splitting into groups by jumps in indexing.

2.txt

TimeStamp   Data    Label
2022-09-20 14:57:28.900 13.656
2022-09-20 14:57:38.900 21.1306
2022-09-20 14:58:39.200 75.877
2022-09-20 14:58:49.200 85.3981
2022-09-20 14:58:59.200 98.7678
2022-09-20 14:59:09.300 107.11
2022-09-20 14:59:19.300 125.618
2022-09-20 14:59:29.400 126.108
2022-09-20 14:59:39.400 124.506
2022-09-20 14:59:49.400 124.172
2022-09-20 14:59:59.500 124.528
2022-09-20 15:00:09.500 121.191
2022-09-20 15:00:19.500 113.049
2022-09-20 15:00:29.500 91.2932
2022-09-20 15:00:39.600 76.8781
2022-09-20 15:00:49.600 55.4778
2022-09-20 15:00:59.600 41.0849
2022-09-20 15:02:09.800 8.02791
2022-09-20 15:03:00.000 27.2703
2022-09-20 15:03:10.000 36.658
2022-09-20 15:04:10.100 83.0846
2022-09-20 15:04:20.100 101.4913
2022-09-20 15:05:40.400 152.869
2022-09-20 15:05:50.400 161.967
2022-09-20 15:06:00.400 166.862
2022-09-20 15:08:40.900 294.930
2022-09-20 15:08:50.900 280.092
2022-09-20 15:09:00.900 261.405
2022-09-20 15:09:11.000 237.291
2022-09-20 15:09:21.000 219.584
2022-09-20 15:09:31.000 191.888
2022-09-20 15:09:41.100 172.979
2022-09-20 15:09:51.100 144.505
2022-09-20 15:10:01.100 125.596
2022-09-20 15:10:11.100 102.883
2022-09-20 15:11:11.300 19.6846
2022-09-20 15:11:21.400 17.8160
2022-09-20 15:11:31.400 27.8932
2022-09-20 15:11:41.400 23.1549
2022-09-20 15:11:51.400 14.4569

main.py

import pandas as pd
import numpy as np
from pprint import pprint


df = pd.read_csv('2.txt', sep='s+')
# We mark all lines greater than 100 with the value 'Y' .
df['mark'] = np.where(df['Label'] >= 100, 'Y', '')
# Select these rows from a column in pandas.Series.
s = df[df['mark'] == 'Y']['Label']
# We divide it into groups according to a uniform change in the index.
grouped = s.groupby(s.index.to_series().diff().ne(1).cumsum())
# We replace the values in the original dataframe in groups where the maximum value is > 160.
for name, group in grouped:
    if group.max() > 160:
        group.loc[:] = 'N'
        df['mark'].update(group)
pprint(df)

------------------------------

     TimeStamp          Data      Label mark
0   2022-09-20  14:57:28.900   13.65600     
1   2022-09-20  14:57:38.900   21.13060     
2   2022-09-20  14:58:39.200   75.87700     
3   2022-09-20  14:58:49.200   85.39810     
4   2022-09-20  14:58:59.200   98.76780     
5   2022-09-20  14:59:09.300  107.11000    Y
6   2022-09-20  14:59:19.300  125.61800    Y
7   2022-09-20  14:59:29.400  126.10800    Y
8   2022-09-20  14:59:39.400  124.50600    Y
9   2022-09-20  14:59:49.400  124.17200    Y
10  2022-09-20  14:59:59.500  124.52800    Y
11  2022-09-20  15:00:09.500  121.19100    Y
12  2022-09-20  15:00:19.500  113.04900    Y
13  2022-09-20  15:00:29.500   91.29320     
14  2022-09-20  15:00:39.600   76.87810     
15  2022-09-20  15:00:49.600   55.47780     
16  2022-09-20  15:00:59.600   41.08490     
17  2022-09-20  15:02:09.800    8.02791     
18  2022-09-20  15:03:00.000   27.27030     
19  2022-09-20  15:03:10.000   36.65800     
20  2022-09-20  15:04:10.100   83.08460     
21  2022-09-20  15:04:20.100  101.49130    N
22  2022-09-20  15:05:40.400  152.86900    N
23  2022-09-20  15:05:50.400  161.96700    N
24  2022-09-20  15:06:00.400  166.86200    N
25  2022-09-20  15:08:40.900  294.93000    N
26  2022-09-20  15:08:50.900  280.09200    N
27  2022-09-20  15:09:00.900  261.40500    N
28  2022-09-20  15:09:11.000  237.29100    N
29  2022-09-20  15:09:21.000  219.58400    N
30  2022-09-20  15:09:31.000  191.88800    N
31  2022-09-20  15:09:41.100  172.97900    N
32  2022-09-20  15:09:51.100  144.50500    N
33  2022-09-20  15:10:01.100  125.59600    N
34  2022-09-20  15:10:11.100  102.88300    N
35  2022-09-20  15:11:11.300   19.68460     
36  2022-09-20  15:11:21.400   17.81600     
37  2022-09-20  15:11:31.400   27.89320     
38  2022-09-20  15:11:41.400   23.15490     
39  2022-09-20  15:11:51.400   14.45690

Answered By: Сергей Кох

Pandas label column values between 100 and 160 as Y, and between 100 and 300 as N based on timestamp column

Question:

Answers:

EDIT