Pandas label column values between 100 and 160 as Y, and between 100 and 300 as N based on timestamp column
Question:
I have timeseries data with 10 secondly data, in which a column’s values start in the range of 100 and 160 for around 2 minutes and then after some time another range of values start between 100 and 300+, I want to label first range values as Y and second range values as N.
The problem is that second range (100-300+) also includes values of first range (100-160). Below is snap of data with the required Label column:
TimeStamp Data Label
2022-09-20 14:57:28.900 13.656
2022-09-20 14:57:38.900 21.1306
2022-09-20 14:58:39.200 75.877
2022-09-20 14:58:49.200 85.3981
2022-09-20 14:58:59.200 98.7678
2022-09-20 14:59:09.300 107.11 Y
2022-09-20 14:59:19.300 125.618 Y
2022-09-20 14:59:29.400 126.108 Y
2022-09-20 14:59:39.400 124.506 Y
2022-09-20 14:59:49.400 124.172 Y
2022-09-20 14:59:59.500 124.528 Y
2022-09-20 15:00:09.500 121.191 Y
2022-09-20 15:00:19.500 113.049 Y
2022-09-20 15:00:29.500 91.2932
2022-09-20 15:00:39.600 76.8781
2022-09-20 15:00:49.600 55.4778
2022-09-20 15:00:59.600 41.0849
2022-09-20 15:02:09.800 8.02791
2022-09-20 15:03:00.000 27.2703
2022-09-20 15:03:10.000 36.658
2022-09-20 15:04:10.100 83.0846
2022-09-20 15:04:20.100 101.4913 N
2022-09-20 15:05:40.400 152.869 N
2022-09-20 15:05:50.400 161.967 N
2022-09-20 15:06:00.400 166.862 N
2022-09-20 15:08:40.900 294.93 N
2022-09-20 15:08:50.900 280.092 N
2022-09-20 15:09:00.900 261.405 N
2022-09-20 15:09:11.000 237.291 N
2022-09-20 15:09:21.000 219.584 N
2022-09-20 15:09:31.000 191.888 N
2022-09-20 15:09:41.100 172.979 N
2022-09-20 15:09:51.100 144.505 N
2022-09-20 15:10:01.100 125.596 N
2022-09-20 15:10:11.100 102.883 N
2022-09-20 15:11:11.300 19.6846
2022-09-20 15:11:21.400 17.816
2022-09-20 15:11:31.400 27.8932
2022-09-20 15:11:41.400 23.1549
2022-09-20 15:11:51.400 14.4569
Any help please?
Answers:
A lambda function applied to a series does the trick here.
df["new_column"] = df.TimeStamp.apply(lambda x: my_condition(x))
That my_condition
function is one you can define. It might do something like:
def my_condition(x):
if x <= my_datetime:
return "Y"
return "N"
Or you can do it all in one line
df["new_column"] = df.TimeStamp.apply(lambda x: "Y" if x <= my_datetime else "N")
EDIT
After the question is edited, it’s clear a new approach is needed. We need to be able to identify the first occurrence of a sequence of rows for which the condition Y is meet (between 100 and 160).
I’d approach by finding the first instance where the condition is met.
idx_start = ((df.Data >= 100) & (df.Data <= 160)).idxmax()
The way this works is the boolean conditions create a boolean mask; every row gets either true or false. The idxmax() converts the bools to ints (1 or 0) and finds the first instance of the max value (i.e. the first instance of 1, which is the first instance of true)
Next up find the first instance where the condition is no longer met. In this case, it’s the first instance where the value is not in the range 100-160, but the previous row IS in the range.
idx_end = ((df.Data < 100) | (df.Data > 160)) & ((df.shift(1).Data >= 100) & (df.shift(1).Data <= 160))
The boolean mask approach is identical, but the shift function is used to identify the previous row, to apply a condition to it.
Once you have the start and end indices, it’s trivial to mark the rows as Y
As a variant of the solution – splitting into groups by jumps in indexing.
2.txt
TimeStamp Data Label
2022-09-20 14:57:28.900 13.656
2022-09-20 14:57:38.900 21.1306
2022-09-20 14:58:39.200 75.877
2022-09-20 14:58:49.200 85.3981
2022-09-20 14:58:59.200 98.7678
2022-09-20 14:59:09.300 107.11
2022-09-20 14:59:19.300 125.618
2022-09-20 14:59:29.400 126.108
2022-09-20 14:59:39.400 124.506
2022-09-20 14:59:49.400 124.172
2022-09-20 14:59:59.500 124.528
2022-09-20 15:00:09.500 121.191
2022-09-20 15:00:19.500 113.049
2022-09-20 15:00:29.500 91.2932
2022-09-20 15:00:39.600 76.8781
2022-09-20 15:00:49.600 55.4778
2022-09-20 15:00:59.600 41.0849
2022-09-20 15:02:09.800 8.02791
2022-09-20 15:03:00.000 27.2703
2022-09-20 15:03:10.000 36.658
2022-09-20 15:04:10.100 83.0846
2022-09-20 15:04:20.100 101.4913
2022-09-20 15:05:40.400 152.869
2022-09-20 15:05:50.400 161.967
2022-09-20 15:06:00.400 166.862
2022-09-20 15:08:40.900 294.930
2022-09-20 15:08:50.900 280.092
2022-09-20 15:09:00.900 261.405
2022-09-20 15:09:11.000 237.291
2022-09-20 15:09:21.000 219.584
2022-09-20 15:09:31.000 191.888
2022-09-20 15:09:41.100 172.979
2022-09-20 15:09:51.100 144.505
2022-09-20 15:10:01.100 125.596
2022-09-20 15:10:11.100 102.883
2022-09-20 15:11:11.300 19.6846
2022-09-20 15:11:21.400 17.8160
2022-09-20 15:11:31.400 27.8932
2022-09-20 15:11:41.400 23.1549
2022-09-20 15:11:51.400 14.4569
main.py
import pandas as pd
import numpy as np
from pprint import pprint
df = pd.read_csv('2.txt', sep='s+')
# We mark all lines greater than 100 with the value 'Y' .
df['mark'] = np.where(df['Label'] >= 100, 'Y', '')
# Select these rows from a column in pandas.Series.
s = df[df['mark'] == 'Y']['Label']
# We divide it into groups according to a uniform change in the index.
grouped = s.groupby(s.index.to_series().diff().ne(1).cumsum())
# We replace the values in the original dataframe in groups where the maximum value is > 160.
for name, group in grouped:
if group.max() > 160:
group.loc[:] = 'N'
df['mark'].update(group)
pprint(df)
------------------------------
TimeStamp Data Label mark
0 2022-09-20 14:57:28.900 13.65600
1 2022-09-20 14:57:38.900 21.13060
2 2022-09-20 14:58:39.200 75.87700
3 2022-09-20 14:58:49.200 85.39810
4 2022-09-20 14:58:59.200 98.76780
5 2022-09-20 14:59:09.300 107.11000 Y
6 2022-09-20 14:59:19.300 125.61800 Y
7 2022-09-20 14:59:29.400 126.10800 Y
8 2022-09-20 14:59:39.400 124.50600 Y
9 2022-09-20 14:59:49.400 124.17200 Y
10 2022-09-20 14:59:59.500 124.52800 Y
11 2022-09-20 15:00:09.500 121.19100 Y
12 2022-09-20 15:00:19.500 113.04900 Y
13 2022-09-20 15:00:29.500 91.29320
14 2022-09-20 15:00:39.600 76.87810
15 2022-09-20 15:00:49.600 55.47780
16 2022-09-20 15:00:59.600 41.08490
17 2022-09-20 15:02:09.800 8.02791
18 2022-09-20 15:03:00.000 27.27030
19 2022-09-20 15:03:10.000 36.65800
20 2022-09-20 15:04:10.100 83.08460
21 2022-09-20 15:04:20.100 101.49130 N
22 2022-09-20 15:05:40.400 152.86900 N
23 2022-09-20 15:05:50.400 161.96700 N
24 2022-09-20 15:06:00.400 166.86200 N
25 2022-09-20 15:08:40.900 294.93000 N
26 2022-09-20 15:08:50.900 280.09200 N
27 2022-09-20 15:09:00.900 261.40500 N
28 2022-09-20 15:09:11.000 237.29100 N
29 2022-09-20 15:09:21.000 219.58400 N
30 2022-09-20 15:09:31.000 191.88800 N
31 2022-09-20 15:09:41.100 172.97900 N
32 2022-09-20 15:09:51.100 144.50500 N
33 2022-09-20 15:10:01.100 125.59600 N
34 2022-09-20 15:10:11.100 102.88300 N
35 2022-09-20 15:11:11.300 19.68460
36 2022-09-20 15:11:21.400 17.81600
37 2022-09-20 15:11:31.400 27.89320
38 2022-09-20 15:11:41.400 23.15490
39 2022-09-20 15:11:51.400 14.45690
I have timeseries data with 10 secondly data, in which a column’s values start in the range of 100 and 160 for around 2 minutes and then after some time another range of values start between 100 and 300+, I want to label first range values as Y and second range values as N.
The problem is that second range (100-300+) also includes values of first range (100-160). Below is snap of data with the required Label column:
TimeStamp Data Label
2022-09-20 14:57:28.900 13.656
2022-09-20 14:57:38.900 21.1306
2022-09-20 14:58:39.200 75.877
2022-09-20 14:58:49.200 85.3981
2022-09-20 14:58:59.200 98.7678
2022-09-20 14:59:09.300 107.11 Y
2022-09-20 14:59:19.300 125.618 Y
2022-09-20 14:59:29.400 126.108 Y
2022-09-20 14:59:39.400 124.506 Y
2022-09-20 14:59:49.400 124.172 Y
2022-09-20 14:59:59.500 124.528 Y
2022-09-20 15:00:09.500 121.191 Y
2022-09-20 15:00:19.500 113.049 Y
2022-09-20 15:00:29.500 91.2932
2022-09-20 15:00:39.600 76.8781
2022-09-20 15:00:49.600 55.4778
2022-09-20 15:00:59.600 41.0849
2022-09-20 15:02:09.800 8.02791
2022-09-20 15:03:00.000 27.2703
2022-09-20 15:03:10.000 36.658
2022-09-20 15:04:10.100 83.0846
2022-09-20 15:04:20.100 101.4913 N
2022-09-20 15:05:40.400 152.869 N
2022-09-20 15:05:50.400 161.967 N
2022-09-20 15:06:00.400 166.862 N
2022-09-20 15:08:40.900 294.93 N
2022-09-20 15:08:50.900 280.092 N
2022-09-20 15:09:00.900 261.405 N
2022-09-20 15:09:11.000 237.291 N
2022-09-20 15:09:21.000 219.584 N
2022-09-20 15:09:31.000 191.888 N
2022-09-20 15:09:41.100 172.979 N
2022-09-20 15:09:51.100 144.505 N
2022-09-20 15:10:01.100 125.596 N
2022-09-20 15:10:11.100 102.883 N
2022-09-20 15:11:11.300 19.6846
2022-09-20 15:11:21.400 17.816
2022-09-20 15:11:31.400 27.8932
2022-09-20 15:11:41.400 23.1549
2022-09-20 15:11:51.400 14.4569
Any help please?
A lambda function applied to a series does the trick here.
df["new_column"] = df.TimeStamp.apply(lambda x: my_condition(x))
That my_condition
function is one you can define. It might do something like:
def my_condition(x):
if x <= my_datetime:
return "Y"
return "N"
Or you can do it all in one line
df["new_column"] = df.TimeStamp.apply(lambda x: "Y" if x <= my_datetime else "N")
EDIT
After the question is edited, it’s clear a new approach is needed. We need to be able to identify the first occurrence of a sequence of rows for which the condition Y is meet (between 100 and 160).
I’d approach by finding the first instance where the condition is met.
idx_start = ((df.Data >= 100) & (df.Data <= 160)).idxmax()
The way this works is the boolean conditions create a boolean mask; every row gets either true or false. The idxmax() converts the bools to ints (1 or 0) and finds the first instance of the max value (i.e. the first instance of 1, which is the first instance of true)
Next up find the first instance where the condition is no longer met. In this case, it’s the first instance where the value is not in the range 100-160, but the previous row IS in the range.
idx_end = ((df.Data < 100) | (df.Data > 160)) & ((df.shift(1).Data >= 100) & (df.shift(1).Data <= 160))
The boolean mask approach is identical, but the shift function is used to identify the previous row, to apply a condition to it.
Once you have the start and end indices, it’s trivial to mark the rows as Y
As a variant of the solution – splitting into groups by jumps in indexing.
2.txt
TimeStamp Data Label
2022-09-20 14:57:28.900 13.656
2022-09-20 14:57:38.900 21.1306
2022-09-20 14:58:39.200 75.877
2022-09-20 14:58:49.200 85.3981
2022-09-20 14:58:59.200 98.7678
2022-09-20 14:59:09.300 107.11
2022-09-20 14:59:19.300 125.618
2022-09-20 14:59:29.400 126.108
2022-09-20 14:59:39.400 124.506
2022-09-20 14:59:49.400 124.172
2022-09-20 14:59:59.500 124.528
2022-09-20 15:00:09.500 121.191
2022-09-20 15:00:19.500 113.049
2022-09-20 15:00:29.500 91.2932
2022-09-20 15:00:39.600 76.8781
2022-09-20 15:00:49.600 55.4778
2022-09-20 15:00:59.600 41.0849
2022-09-20 15:02:09.800 8.02791
2022-09-20 15:03:00.000 27.2703
2022-09-20 15:03:10.000 36.658
2022-09-20 15:04:10.100 83.0846
2022-09-20 15:04:20.100 101.4913
2022-09-20 15:05:40.400 152.869
2022-09-20 15:05:50.400 161.967
2022-09-20 15:06:00.400 166.862
2022-09-20 15:08:40.900 294.930
2022-09-20 15:08:50.900 280.092
2022-09-20 15:09:00.900 261.405
2022-09-20 15:09:11.000 237.291
2022-09-20 15:09:21.000 219.584
2022-09-20 15:09:31.000 191.888
2022-09-20 15:09:41.100 172.979
2022-09-20 15:09:51.100 144.505
2022-09-20 15:10:01.100 125.596
2022-09-20 15:10:11.100 102.883
2022-09-20 15:11:11.300 19.6846
2022-09-20 15:11:21.400 17.8160
2022-09-20 15:11:31.400 27.8932
2022-09-20 15:11:41.400 23.1549
2022-09-20 15:11:51.400 14.4569
main.py
import pandas as pd
import numpy as np
from pprint import pprint
df = pd.read_csv('2.txt', sep='s+')
# We mark all lines greater than 100 with the value 'Y' .
df['mark'] = np.where(df['Label'] >= 100, 'Y', '')
# Select these rows from a column in pandas.Series.
s = df[df['mark'] == 'Y']['Label']
# We divide it into groups according to a uniform change in the index.
grouped = s.groupby(s.index.to_series().diff().ne(1).cumsum())
# We replace the values in the original dataframe in groups where the maximum value is > 160.
for name, group in grouped:
if group.max() > 160:
group.loc[:] = 'N'
df['mark'].update(group)
pprint(df)
------------------------------
TimeStamp Data Label mark
0 2022-09-20 14:57:28.900 13.65600
1 2022-09-20 14:57:38.900 21.13060
2 2022-09-20 14:58:39.200 75.87700
3 2022-09-20 14:58:49.200 85.39810
4 2022-09-20 14:58:59.200 98.76780
5 2022-09-20 14:59:09.300 107.11000 Y
6 2022-09-20 14:59:19.300 125.61800 Y
7 2022-09-20 14:59:29.400 126.10800 Y
8 2022-09-20 14:59:39.400 124.50600 Y
9 2022-09-20 14:59:49.400 124.17200 Y
10 2022-09-20 14:59:59.500 124.52800 Y
11 2022-09-20 15:00:09.500 121.19100 Y
12 2022-09-20 15:00:19.500 113.04900 Y
13 2022-09-20 15:00:29.500 91.29320
14 2022-09-20 15:00:39.600 76.87810
15 2022-09-20 15:00:49.600 55.47780
16 2022-09-20 15:00:59.600 41.08490
17 2022-09-20 15:02:09.800 8.02791
18 2022-09-20 15:03:00.000 27.27030
19 2022-09-20 15:03:10.000 36.65800
20 2022-09-20 15:04:10.100 83.08460
21 2022-09-20 15:04:20.100 101.49130 N
22 2022-09-20 15:05:40.400 152.86900 N
23 2022-09-20 15:05:50.400 161.96700 N
24 2022-09-20 15:06:00.400 166.86200 N
25 2022-09-20 15:08:40.900 294.93000 N
26 2022-09-20 15:08:50.900 280.09200 N
27 2022-09-20 15:09:00.900 261.40500 N
28 2022-09-20 15:09:11.000 237.29100 N
29 2022-09-20 15:09:21.000 219.58400 N
30 2022-09-20 15:09:31.000 191.88800 N
31 2022-09-20 15:09:41.100 172.97900 N
32 2022-09-20 15:09:51.100 144.50500 N
33 2022-09-20 15:10:01.100 125.59600 N
34 2022-09-20 15:10:11.100 102.88300 N
35 2022-09-20 15:11:11.300 19.68460
36 2022-09-20 15:11:21.400 17.81600
37 2022-09-20 15:11:31.400 27.89320
38 2022-09-20 15:11:41.400 23.15490
39 2022-09-20 15:11:51.400 14.45690