check if values are between two values pandas
Question:
I have a two values that are being found in a for loop like so:
for i in range(df_zones.shape[0]):
filter_max = df_labels[df_labels['Labels'] == i].sort_values(by='level').iloc[-1]
filter_min = df_labels[df_labels['Labels'] == i].sort_values(by='level').iloc[0]
I have another dataframe with 4 columns of measurements with a timeseries index, like so:
DateTime
meas1
meas2
meas3
meas4
2022-1-1
1.1
1.2
1.3
1.1
There are 1000’s of rows of data.
What I am trying to do is have another column that is labeled as ‘isZone’, where this means, are any of the values in the row between filter_max and filter_min.
DateTime
meas1
meas2
meas3
meas4
isZone
2022-1-1
1.1
1.5
1.5
1.7
0
2022-1-2
2.2
1.4
1.5
1.7
0
2022-1-3
3.1
1.2
1.3
1.1
1
2022-1-4
4.1
1.2
1.3
1.1
1
2022-1-5
5.1
1.2
1.3
1.1
1
I have read about the pandas between function. But I really can’t figure out how to make this work. Is there a quicker way to do this in numpy? any guidance would be appreciated.
Answers:
You can solve this with apply
and pandas’ between:
df_zones['Flag'] = df_zones.apply(lambda x: 1 if x.between(filter_min,filter_max).any() else 0,axis=1)
How about trying with .T
and using a list-comprehension this way?
df_zones['Flag'] = [1 if df_zones.T[x].between(min_,max_).any() else 0 for x in df_zones.T]
Or without the transposing:
df_zones['Flag'] = [1 if df_zones.loc[x,:].between(min_,max_).any() else 0 for x in df_zones.index]
The method above took 30 minutes to compute, the below method is done in under 2 seconds.
In the end, the best method was to append all the items to their own lists, and make a function that combines all the pandas between checks and checks row-wise if there is a one or not.
”’
def arrayBoolCheck(arrays):
df = pd.DataFrame(arrays).T
df_new = (df.iloc[:, :] == 1).any(axis=1).astype(int)
return df_new
isZone1, isZone2, isZone3, isZone4 = [], [], [], [], []
for i in range(df_zones.shape[0]):
filter_max = df_labels[df_labels['Labels'] == i].sort_values(by='level').iloc[-1]
filter_min = df_labels[df_labels['Labels'] == i].sort_values(by='level').iloc[0]
isZone1.append(df_instrument[f"meas1"].between(filter_min, filter_max, inclusive='both').astype(int).values)
isZone2.append(df_instrument[f"meas2"].between(filter_min, filter_max, inclusive='both').astype(int).values)
isZone3.append(df_instrument[f"meas3"].between(filter_min, filter_max, inclusive='both').astype(int).values)
isZone4.append(df_instrument[f"meas4"].between(filter_min, filter_max, inclusive='both').astype(int).values)
# Zone Labels Dataframe
df = pd.DataFrame(data=[
self.arrayBoolCheck(np.array(isZone1)),
self.arrayBoolCheck(np.array(isZone2)),
self.arrayBoolCheck(np.array(isZone3)),
self.arrayBoolCheck(np.array(isZone4))],
index=[f"isZone1",
f"isZone2",
f"isZone3",
f"isZone4"]).T
”’
I found that keeping the measurements in their respective columns was better for the analysis. but the same function could be used to combine them all into one column if needed.
I have a two values that are being found in a for loop like so:
for i in range(df_zones.shape[0]):
filter_max = df_labels[df_labels['Labels'] == i].sort_values(by='level').iloc[-1]
filter_min = df_labels[df_labels['Labels'] == i].sort_values(by='level').iloc[0]
I have another dataframe with 4 columns of measurements with a timeseries index, like so:
DateTime | meas1 | meas2 | meas3 | meas4 |
---|---|---|---|---|
2022-1-1 | 1.1 | 1.2 | 1.3 | 1.1 |
There are 1000’s of rows of data.
What I am trying to do is have another column that is labeled as ‘isZone’, where this means, are any of the values in the row between filter_max and filter_min.
DateTime | meas1 | meas2 | meas3 | meas4 | isZone |
---|---|---|---|---|---|
2022-1-1 | 1.1 | 1.5 | 1.5 | 1.7 | 0 |
2022-1-2 | 2.2 | 1.4 | 1.5 | 1.7 | 0 |
2022-1-3 | 3.1 | 1.2 | 1.3 | 1.1 | 1 |
2022-1-4 | 4.1 | 1.2 | 1.3 | 1.1 | 1 |
2022-1-5 | 5.1 | 1.2 | 1.3 | 1.1 | 1 |
I have read about the pandas between function. But I really can’t figure out how to make this work. Is there a quicker way to do this in numpy? any guidance would be appreciated.
You can solve this with apply
and pandas’ between:
df_zones['Flag'] = df_zones.apply(lambda x: 1 if x.between(filter_min,filter_max).any() else 0,axis=1)
How about trying with .T
and using a list-comprehension this way?
df_zones['Flag'] = [1 if df_zones.T[x].between(min_,max_).any() else 0 for x in df_zones.T]
Or without the transposing:
df_zones['Flag'] = [1 if df_zones.loc[x,:].between(min_,max_).any() else 0 for x in df_zones.index]
The method above took 30 minutes to compute, the below method is done in under 2 seconds.
In the end, the best method was to append all the items to their own lists, and make a function that combines all the pandas between checks and checks row-wise if there is a one or not.
”’
def arrayBoolCheck(arrays):
df = pd.DataFrame(arrays).T
df_new = (df.iloc[:, :] == 1).any(axis=1).astype(int)
return df_new
isZone1, isZone2, isZone3, isZone4 = [], [], [], [], []
for i in range(df_zones.shape[0]):
filter_max = df_labels[df_labels['Labels'] == i].sort_values(by='level').iloc[-1]
filter_min = df_labels[df_labels['Labels'] == i].sort_values(by='level').iloc[0]
isZone1.append(df_instrument[f"meas1"].between(filter_min, filter_max, inclusive='both').astype(int).values)
isZone2.append(df_instrument[f"meas2"].between(filter_min, filter_max, inclusive='both').astype(int).values)
isZone3.append(df_instrument[f"meas3"].between(filter_min, filter_max, inclusive='both').astype(int).values)
isZone4.append(df_instrument[f"meas4"].between(filter_min, filter_max, inclusive='both').astype(int).values)
# Zone Labels Dataframe
df = pd.DataFrame(data=[
self.arrayBoolCheck(np.array(isZone1)),
self.arrayBoolCheck(np.array(isZone2)),
self.arrayBoolCheck(np.array(isZone3)),
self.arrayBoolCheck(np.array(isZone4))],
index=[f"isZone1",
f"isZone2",
f"isZone3",
f"isZone4"]).T
”’
I found that keeping the measurements in their respective columns was better for the analysis. but the same function could be used to combine them all into one column if needed.