Check for conditional first occurrence in a dataframe values
Question:
I have a sample dataframe(df) like below:
Date_Time Open High Low Close UOD VWB
20 2020-07-01 10:30:00 10298.85 10299.90 10287.85 10299.90 UP 3
21 2020-07-01 10:35:00 10301.40 10310.00 10299.15 10305.75 UP 3
22 2020-07-01 10:40:00 10305.75 10305.75 10285.50 10290.00 DOWN 3
24 2020-07-01 10:45:00 10290.00 10291.20 10277.65 10282.65 DOWN 0
25 2020-07-01 10:50:00 10282.30 10289.80 10278.00 10282.00 DOWN 3
26 2020-07-01 10:55:00 10280.10 10295.00 10279.80 10291.50 UP 3
27 2020-07-01 11:00:00 10290.00 10299.95 10287.30 10297.55 UP 3
28 2020-07-01 11:05:00 10296.70 10306.30 10294.50 10299.40 UP 3
29 2020-07-01 11:10:00 10299.95 10301.10 10291.50 10292.00 DOWN 0
30 2020-07-01 11:15:00 10293.05 10298.70 10286.00 10291.55 DOWN 3
31 2020-07-01 11:20:00 10292.00 10298.70 10286.00 10351.45 DOWN 1
I have below conditions:
- Check for df[‘VWB’] == 0 & df[‘UOD’] == "DOWN" & get the corresponding Open value (= 10290.00 in my example)
- Then Find the first occurrence of Close value greater than this Open value (10290.00) after that row.
I want my desired outout as below with Valid Column
Date_Time Open High Low Close UOD VWB Valid
20 2020-07-01 10:30:00 10298.85 10299.90 10287.85 10299.90 UP 3 0
21 2020-07-01 10:35:00 10301.40 10310.00 10299.15 10305.75 UP 3 0
22 2020-07-01 10:40:00 10305.75 10305.75 10285.50 10290.00 DOWN 3 0
23 2020-07-01 10:45:00 10290.00 10291.20 10277.65 10282.65 DOWN 0 0
25 2020-07-01 10:50:00 10282.30 10289.80 10278.00 10282.00 DOWN 3 0
26 2020-07-01 10:55:00 10280.10 10295.00 10279.80 10291.50 UP 3 1 <<= first occurrence
27 2020-07-01 11:00:00 10290.00 10299.95 10287.30 10297.55 UP 3 0
28 2020-07-01 11:05:00 10296.70 10306.30 10294.50 10299.40 UP 3 0
29 2020-07-01 11:10:00 10299.95 10301.10 10291.50 10292.00 DOWN 0 0
30 2020-07-01 11:15:00 10293.05 10298.70 10286.00 10291.55 DOWN 3 0
31 2020-07-01 11:20:00 10292.00 10298.70 10286.00 10351.45 DOWN 1 1 <<= first occurrence
Answers:
This is a little tricky as I assume it’s possible to have multiple values with the following bool.
df.loc[(df["VWB"] == 0) & (df["UOD"] == "DOWN")]
We can create a psuedo key to capture each group with a vectorised operation.
I’ve edited your sample so we have 2 values that can equate to True for the above boolean.
print(df)
Date_Time Open High Low Close UOD VWB
0 2020-07-01 10:30:00 10298.85 10299.90 10287.85 10299.90 UP 3
1 2020-07-01 10:35:00 10301.40 10310.00 10299.15 10305.75 UP 3
2 2020-07-01 10:40:00 10305.75 10305.75 10285.50 10290.00 DOWN 3
3 2020-07-01 10:45:00 10290.00 10291.20 10277.65 10282.65 DOWN 0
4 2020-07-01 10:50:00 10282.30 10289.80 10278.00 10282.00 DOWN 3
5 2020-07-01 10:55:00 10280.10 10295.00 10279.80 10291.50 UP 3
6 2020-07-01 11:00:00 10290.00 10299.95 10287.30 10297.55 UP 3
7 2020-07-01 11:05:00 10296.70 10306.30 10294.50 10299.40 UP 3
8 2020-07-01 11:10:00 10299.95 10301.10 10291.50 10292.00 DOWN 0
9 2020-07-01 11:15:00 10293.05 10298.70 10286.00 10595.55 DOWN 3
s = df.loc[(df["VWB"] == 0) & (df["UOD"] == "DOWN"), "Open"]
df1 = df.assign(key=df.index.isin(s.index).cumsum())
# we will filter out the 0 key.
print(df1)
Date_Time Open High Low Close UOD VWB key
0 2020-07-01 10:30:00 10298.85 10299.90 10287.85 10299.90 UP 3 0
1 2020-07-01 10:35:00 10301.40 10310.00 10299.15 10305.75 UP 3 0
2 2020-07-01 10:40:00 10305.75 10305.75 10285.50 10290.00 DOWN 3 0
3 2020-07-01 10:45:00 10290.00 10291.20 10277.65 10282.65 DOWN 0 1
4 2020-07-01 10:50:00 10282.30 10289.80 10278.00 10282.00 DOWN 3 1
5 2020-07-01 10:55:00 10280.10 10295.00 10279.80 10291.50 UP 3 1
6 2020-07-01 11:00:00 10290.00 10299.95 10287.30 10297.55 UP 3 1
7 2020-07-01 11:05:00 10296.70 10306.30 10294.50 10299.40 UP 3 1
8 2020-07-01 11:10:00 10299.95 10301.10 10291.50 10292.00 DOWN 0 2
9 2020-07-01 11:15:00 10293.05 10298.70 10286.00 10595.55 DOWN 3 2
now for each group we need to compare the first instance of Open
and see where Close
is greater.
idx = df1.assign(tempOpen=df1.groupby("key")["Open"].transform("first")).query(
"Close > tempOpen"
).groupby("key", as_index=False)["key"].idxmin()
df['valid'] = np.where(df1.index.isin(idx) & df1.key.ne(0),1,0)
print(df[['Open','Close','valid']])
Open Close valid
0 10298.85 10299.90 0
1 10301.40 10305.75 0
2 10305.75 10290.00 0
3 10290.00 10282.65 0
4 10282.30 10282.00 0
5 10280.10 10291.50 1
6 10290.00 10297.55 0
7 10296.70 10299.40 0
8 10299.95 10292.00 0
9 10293.05 10595.55 1
Try:
df['Val'] = 0
# 1st condition
open_val = df.loc[(df['VWB'].eq(0)) & (df['UOD'].eq("DOWN"))]['Open'].values[0]
u = df.loc[(df['Close'] > open_val)]
# 2nd condition
pos = u.iloc[(u['Close'] - open_val).argsort()[0]]
df.loc[pos,'Val'] = 1
Date_Time Open High Low Close UOD VWB Val
20 2020-07-01 10:30:00 10298.85 10299.90 10287.85 10299.90 UP 3 0
21 2020-07-01 10:35:00 10301.40 10310.00 10299.15 10305.75 UP 3 0
22 2020-07-01 10:40:00 10305.75 10305.75 10285.50 10290.00 DOWN 3 0
24 2020-07-01 10:45:00 10290.00 10291.20 10277.65 10282.65 DOWN 0 0
25 2020-07-01 10:50:00 10282.30 10289.80 10278.00 10282.00 DOWN 3 0
26 2020-07-01 10:55:00 10280.10 10295.00 10279.80 10291.50 UP 3 1
27 2020-07-01 11:00:00 10290.00 10299.95 10287.30 10297.55 UP 3 0
28 2020-07-01 11:05:00 10296.70 10306.30 10294.50 10299.40 UP 3 0
29 2020-07-01 11:10:00 10299.95 10301.10 10291.50 10292.00 DOWN 3 0
30 2020-07-01 11:15:00 10293.05 10298.70 10286.00 10291.55 DOWN 3 0
You can follow this approach using apply
:
def valid_column(df):
max_val = max(df['Open']) + 1
min_open = max_val
def find_valid(row):
global min_open
if min_open < max_val and row['Close'] > min_open:
min_open = max_val
return 1
if row['VWB'] == 0 and row['UOD'] == "DOWN":
min_open = min(min_open, row['Open'])
return 0
return df.apply(find_valid, axis=1)
df['Valid'] = valid_column(df)
You only go through the dataset once, and using the apply
function which is very efficient.
The min_open
variable keeps track of the lowest "Open" value. If any row has a "Close" value bigger, then a 1 is returned and min_open
is reset.
Note that one drawback of this approach, is the use of the global
keyword which means you cannot have another variable in you code with the same name.
One small nit to Umar’s good answer is that numpy wants parenthesis surrounding multiple conditionals:
df['valid'] = np.where(df1.index.isin(idx) & df1.key.ne(0),1,0)
// should be:
df['valid'] = np.where((df1.index.isin(idx['key'])) & (df1.key.ne(0)), 1, 0)
I have a sample dataframe(df) like below:
Date_Time Open High Low Close UOD VWB
20 2020-07-01 10:30:00 10298.85 10299.90 10287.85 10299.90 UP 3
21 2020-07-01 10:35:00 10301.40 10310.00 10299.15 10305.75 UP 3
22 2020-07-01 10:40:00 10305.75 10305.75 10285.50 10290.00 DOWN 3
24 2020-07-01 10:45:00 10290.00 10291.20 10277.65 10282.65 DOWN 0
25 2020-07-01 10:50:00 10282.30 10289.80 10278.00 10282.00 DOWN 3
26 2020-07-01 10:55:00 10280.10 10295.00 10279.80 10291.50 UP 3
27 2020-07-01 11:00:00 10290.00 10299.95 10287.30 10297.55 UP 3
28 2020-07-01 11:05:00 10296.70 10306.30 10294.50 10299.40 UP 3
29 2020-07-01 11:10:00 10299.95 10301.10 10291.50 10292.00 DOWN 0
30 2020-07-01 11:15:00 10293.05 10298.70 10286.00 10291.55 DOWN 3
31 2020-07-01 11:20:00 10292.00 10298.70 10286.00 10351.45 DOWN 1
I have below conditions:
- Check for df[‘VWB’] == 0 & df[‘UOD’] == "DOWN" & get the corresponding Open value (= 10290.00 in my example)
- Then Find the first occurrence of Close value greater than this Open value (10290.00) after that row.
I want my desired outout as below with Valid Column
Date_Time Open High Low Close UOD VWB Valid
20 2020-07-01 10:30:00 10298.85 10299.90 10287.85 10299.90 UP 3 0
21 2020-07-01 10:35:00 10301.40 10310.00 10299.15 10305.75 UP 3 0
22 2020-07-01 10:40:00 10305.75 10305.75 10285.50 10290.00 DOWN 3 0
23 2020-07-01 10:45:00 10290.00 10291.20 10277.65 10282.65 DOWN 0 0
25 2020-07-01 10:50:00 10282.30 10289.80 10278.00 10282.00 DOWN 3 0
26 2020-07-01 10:55:00 10280.10 10295.00 10279.80 10291.50 UP 3 1 <<= first occurrence
27 2020-07-01 11:00:00 10290.00 10299.95 10287.30 10297.55 UP 3 0
28 2020-07-01 11:05:00 10296.70 10306.30 10294.50 10299.40 UP 3 0
29 2020-07-01 11:10:00 10299.95 10301.10 10291.50 10292.00 DOWN 0 0
30 2020-07-01 11:15:00 10293.05 10298.70 10286.00 10291.55 DOWN 3 0
31 2020-07-01 11:20:00 10292.00 10298.70 10286.00 10351.45 DOWN 1 1 <<= first occurrence
This is a little tricky as I assume it’s possible to have multiple values with the following bool.
df.loc[(df["VWB"] == 0) & (df["UOD"] == "DOWN")]
We can create a psuedo key to capture each group with a vectorised operation.
I’ve edited your sample so we have 2 values that can equate to True for the above boolean.
print(df)
Date_Time Open High Low Close UOD VWB
0 2020-07-01 10:30:00 10298.85 10299.90 10287.85 10299.90 UP 3
1 2020-07-01 10:35:00 10301.40 10310.00 10299.15 10305.75 UP 3
2 2020-07-01 10:40:00 10305.75 10305.75 10285.50 10290.00 DOWN 3
3 2020-07-01 10:45:00 10290.00 10291.20 10277.65 10282.65 DOWN 0
4 2020-07-01 10:50:00 10282.30 10289.80 10278.00 10282.00 DOWN 3
5 2020-07-01 10:55:00 10280.10 10295.00 10279.80 10291.50 UP 3
6 2020-07-01 11:00:00 10290.00 10299.95 10287.30 10297.55 UP 3
7 2020-07-01 11:05:00 10296.70 10306.30 10294.50 10299.40 UP 3
8 2020-07-01 11:10:00 10299.95 10301.10 10291.50 10292.00 DOWN 0
9 2020-07-01 11:15:00 10293.05 10298.70 10286.00 10595.55 DOWN 3
s = df.loc[(df["VWB"] == 0) & (df["UOD"] == "DOWN"), "Open"]
df1 = df.assign(key=df.index.isin(s.index).cumsum())
# we will filter out the 0 key.
print(df1)
Date_Time Open High Low Close UOD VWB key
0 2020-07-01 10:30:00 10298.85 10299.90 10287.85 10299.90 UP 3 0
1 2020-07-01 10:35:00 10301.40 10310.00 10299.15 10305.75 UP 3 0
2 2020-07-01 10:40:00 10305.75 10305.75 10285.50 10290.00 DOWN 3 0
3 2020-07-01 10:45:00 10290.00 10291.20 10277.65 10282.65 DOWN 0 1
4 2020-07-01 10:50:00 10282.30 10289.80 10278.00 10282.00 DOWN 3 1
5 2020-07-01 10:55:00 10280.10 10295.00 10279.80 10291.50 UP 3 1
6 2020-07-01 11:00:00 10290.00 10299.95 10287.30 10297.55 UP 3 1
7 2020-07-01 11:05:00 10296.70 10306.30 10294.50 10299.40 UP 3 1
8 2020-07-01 11:10:00 10299.95 10301.10 10291.50 10292.00 DOWN 0 2
9 2020-07-01 11:15:00 10293.05 10298.70 10286.00 10595.55 DOWN 3 2
now for each group we need to compare the first instance of Open
and see where Close
is greater.
idx = df1.assign(tempOpen=df1.groupby("key")["Open"].transform("first")).query(
"Close > tempOpen"
).groupby("key", as_index=False)["key"].idxmin()
df['valid'] = np.where(df1.index.isin(idx) & df1.key.ne(0),1,0)
print(df[['Open','Close','valid']])
Open Close valid
0 10298.85 10299.90 0
1 10301.40 10305.75 0
2 10305.75 10290.00 0
3 10290.00 10282.65 0
4 10282.30 10282.00 0
5 10280.10 10291.50 1
6 10290.00 10297.55 0
7 10296.70 10299.40 0
8 10299.95 10292.00 0
9 10293.05 10595.55 1
Try:
df['Val'] = 0
# 1st condition
open_val = df.loc[(df['VWB'].eq(0)) & (df['UOD'].eq("DOWN"))]['Open'].values[0]
u = df.loc[(df['Close'] > open_val)]
# 2nd condition
pos = u.iloc[(u['Close'] - open_val).argsort()[0]]
df.loc[pos,'Val'] = 1
Date_Time Open High Low Close UOD VWB Val
20 2020-07-01 10:30:00 10298.85 10299.90 10287.85 10299.90 UP 3 0
21 2020-07-01 10:35:00 10301.40 10310.00 10299.15 10305.75 UP 3 0
22 2020-07-01 10:40:00 10305.75 10305.75 10285.50 10290.00 DOWN 3 0
24 2020-07-01 10:45:00 10290.00 10291.20 10277.65 10282.65 DOWN 0 0
25 2020-07-01 10:50:00 10282.30 10289.80 10278.00 10282.00 DOWN 3 0
26 2020-07-01 10:55:00 10280.10 10295.00 10279.80 10291.50 UP 3 1
27 2020-07-01 11:00:00 10290.00 10299.95 10287.30 10297.55 UP 3 0
28 2020-07-01 11:05:00 10296.70 10306.30 10294.50 10299.40 UP 3 0
29 2020-07-01 11:10:00 10299.95 10301.10 10291.50 10292.00 DOWN 3 0
30 2020-07-01 11:15:00 10293.05 10298.70 10286.00 10291.55 DOWN 3 0
You can follow this approach using apply
:
def valid_column(df):
max_val = max(df['Open']) + 1
min_open = max_val
def find_valid(row):
global min_open
if min_open < max_val and row['Close'] > min_open:
min_open = max_val
return 1
if row['VWB'] == 0 and row['UOD'] == "DOWN":
min_open = min(min_open, row['Open'])
return 0
return df.apply(find_valid, axis=1)
df['Valid'] = valid_column(df)
You only go through the dataset once, and using the apply
function which is very efficient.
The min_open
variable keeps track of the lowest "Open" value. If any row has a "Close" value bigger, then a 1 is returned and min_open
is reset.
Note that one drawback of this approach, is the use of the global
keyword which means you cannot have another variable in you code with the same name.
One small nit to Umar’s good answer is that numpy wants parenthesis surrounding multiple conditionals:
df['valid'] = np.where(df1.index.isin(idx) & df1.key.ne(0),1,0)
// should be:
df['valid'] = np.where((df1.index.isin(idx['key'])) & (df1.key.ne(0)), 1, 0)