Vectorized way to find first occurrence per row
Question:
I have two Pandas DataFrames df_x and df_y. df_x has two columns ‘high target’ and ‘low target’. Per every row of df_x, I would like to search through the instances of df_y and see whether the ‘high target’ was reached before the ‘low target’. Currently, I implemented the above using .apply. However, my code is too inefficient as it linearly scales with the number of rows in df_x. Any suggestions to optimize/vectorize my code?
def efficient_high_after_low(row, minute_df):
"""True, if high happened after the low, else False.
Args:
row: A pandas dataframe
with high, low,
minute_df: the whole dataframe
"""
minute_df_after = minute_df.loc[row.period_end_idx+pd.Timedelta(minutes=1):]
#print(minute_df_after)
first_highs = (minute_df_after.ge(row['high target']))
first_lows = (minute_df_after.le(row['low target']))
hi_sum, lo_sum = first_highs.sum(), first_lows.sum()
if (len(first_highs) != len(first_lows)):
raise Exception('Unequal length of first_highs and first_lows')
else:
if ((len(first_highs) == 0)):
return None
elif ((hi_sum == 0) & (lo_sum != 0)):
return True
elif ((hi_sum != 0) & (low_sum == 0)):
return False
elif ((hi_sum == 0) & (low_sum == 0)):
return None
elif (first_highs.idxmax() > first_lows.idxmax()):
return True
elif(first_highs.idxmax() < first_lows.idxmax()):
return False
else:
return None
And I do the following to get these boolean values:
df_x.apply(efficient_high_after_low, axis=1, args=(df_y['open'],))
Running the code above on the first 1000 lines takes 4 seconds.
Answers:
This is what you could do:
First of all put the open
column in your main dataframe, let’s call it df
(note: this only works if you have the exact same index on df_y
, if you don’t, consider other solutions like pd.concat
or pd.merge_asof
)
df = df_x
df["open"] = df_y["open"]
I also took the liberty of renaming your columns.
As long as your timeseries index is ordered, we can reset the index with
df = df.reset_index()
So now we have df
something like this (values are made up):
high_trgt low_trgt open
0 8.746911 8.712824 9.243329
1 9.472977 10.190079 9.744083
2 9.445111 10.269676 9.859353
3 9.972061 10.014381 9.132204
4 8.934692 8.914729 11.453276
# You "time" column isn't actually necessary for this solution
We can create a map of where the targets have been hit
map_high = df.open.values >= df.high_trgt.values
map_low = df.open.values <= df.low_trgt.values
Now the resource intensive bit:
df["high_was_hit_on"] = pd.Series([map_high[i+1:].argmax() for i in range(len(map_high)-1)])
df["low_was_hit_on"] = pd.Series([map_low[i+1:].argmax() for i in range(len(map_low)-1)])
Output:
high_trgt low_trgt open high_was_hit_on low_was_hit_on
0 8.746911 8.712824 9.243329 0 0
1 9.472977 10.190079 9.744083 0 3
2 9.445111 10.269676 9.859353 0 2
3 9.972061 10.014381 9.132204 1 1
4 8.934692 8.914729 11.453276 0 0
What I did here is iterating over the range of the columns, and just checking where the highest value (hence, index) starting from that row is in the map that we created before.
Now we can easily check which happened first by doing:
# Here you can customize what you need the results to be
# when two hits happen at the same time
df["high_after_low"] = df.high_was_hit_on < df.low_was_hit_on
In terms of speed, this is the test over a df
with 1M rows:
%timeit find_first_hit(df)
3.46 s ± 253 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Truth be told, this isn’t exactly vectorized, but I can’t think of anything that you could do to really achieve that here. Hope that my solution is helpful anyway.
Solution 1 (Iterative):
def high_after_low(high_targets,
low_targets,
vals,
dic):
"""True, if the high target was hit before low; else, False.
Args:
high_target: A NumPy array, the high targets
of each row.
low_target: A NumPy array, the low targets
of each row.
vals: A NumPy array, the current values
dic: A dictionary; maps every row of the targets to the open values.
Returns:
high_after_low: A pandas series with True, False, and None values per row.
Meaning of each value:
True: the timeseries hit high after low (or low was never hit)
False: the timeseries hit high before low (or high was never hit)
None: (1) neither low or high were hit, (2) low and high were hit
at the same row
"""
dic_keys = list(dic.keys())
size = len(dic_keys)
high_hit_rows = [(((a).argmax()+dic[i] ) if((a:= (vals[dic[i]:]
>= high_target[i])).any()) else np.nan ) for i in range(size)]
low_hit_rows = [(((b).argmax()+dic[i]) if((b:= (vals[dic[i]:]
<= low_target[i])).any()) else np.nan ) for i in range(size)]
high_hit_rows = np.array(high_hit_rows, dtype=np.float32)
low_hit_rows = np.array(low_hit_rows, dtype=np.float32)
high_after_low = np.empty((size))
high_after_low[:] = np.nan
high_after_low[np.isnan(low_hit_rows) & (~np.isnan(high_hit_rows))] = False
high_after_low[(~np.isnan(low_hit_rows)) & np.isnan(high_hit_rows)] = True
high_after_low[((~np.isnan(low_hit_rows)) & (~np.isnan(high_hit_rows))
& (low_hit_rows < high_hit_rows))] = True
high_after_low[((~np.isnan(low_hit_rows)) & (~np.isnan(high_hit_rows))
& (low_hit_rows > high_hit_rows))] = False
return high_after_low
Solution 1 (Vectorized):
The vectorized solution requires pre-processing of the input array. We need to pre-process the input array into a 2d array such that the i-th row contains the array’s values from [i:i+T]. Then,
def vectorized_high_after_low(df, high_values, low_values):
"""Args:
df: A pandas DataFrame, containing each value per row. Each column contains the values or row i, t rows ahead.
high_values: The high targets corresponding to each row
low_values: The low targets corresponding to each row.
"""
higher = (df.ge(high_values)).idxmax(axis=1)
lower = (df.le(low_values)).idxmax(axis=1)
higher[higher==0] = df.shape[1]
lower[lower==0] = df.shape[1]
high_after_low = higher < lower
high_after_low[higher==lower] = np.nan
return high_after_low
I have two Pandas DataFrames df_x and df_y. df_x has two columns ‘high target’ and ‘low target’. Per every row of df_x, I would like to search through the instances of df_y and see whether the ‘high target’ was reached before the ‘low target’. Currently, I implemented the above using .apply. However, my code is too inefficient as it linearly scales with the number of rows in df_x. Any suggestions to optimize/vectorize my code?
def efficient_high_after_low(row, minute_df):
"""True, if high happened after the low, else False.
Args:
row: A pandas dataframe
with high, low,
minute_df: the whole dataframe
"""
minute_df_after = minute_df.loc[row.period_end_idx+pd.Timedelta(minutes=1):]
#print(minute_df_after)
first_highs = (minute_df_after.ge(row['high target']))
first_lows = (minute_df_after.le(row['low target']))
hi_sum, lo_sum = first_highs.sum(), first_lows.sum()
if (len(first_highs) != len(first_lows)):
raise Exception('Unequal length of first_highs and first_lows')
else:
if ((len(first_highs) == 0)):
return None
elif ((hi_sum == 0) & (lo_sum != 0)):
return True
elif ((hi_sum != 0) & (low_sum == 0)):
return False
elif ((hi_sum == 0) & (low_sum == 0)):
return None
elif (first_highs.idxmax() > first_lows.idxmax()):
return True
elif(first_highs.idxmax() < first_lows.idxmax()):
return False
else:
return None
And I do the following to get these boolean values:
df_x.apply(efficient_high_after_low, axis=1, args=(df_y['open'],))
Running the code above on the first 1000 lines takes 4 seconds.
This is what you could do:
First of all put the open
column in your main dataframe, let’s call it df
(note: this only works if you have the exact same index on df_y
, if you don’t, consider other solutions like pd.concat
or pd.merge_asof
)
df = df_x
df["open"] = df_y["open"]
I also took the liberty of renaming your columns.
As long as your timeseries index is ordered, we can reset the index with
df = df.reset_index()
So now we have df
something like this (values are made up):
high_trgt low_trgt open
0 8.746911 8.712824 9.243329
1 9.472977 10.190079 9.744083
2 9.445111 10.269676 9.859353
3 9.972061 10.014381 9.132204
4 8.934692 8.914729 11.453276
# You "time" column isn't actually necessary for this solution
We can create a map of where the targets have been hit
map_high = df.open.values >= df.high_trgt.values
map_low = df.open.values <= df.low_trgt.values
Now the resource intensive bit:
df["high_was_hit_on"] = pd.Series([map_high[i+1:].argmax() for i in range(len(map_high)-1)])
df["low_was_hit_on"] = pd.Series([map_low[i+1:].argmax() for i in range(len(map_low)-1)])
Output:
high_trgt low_trgt open high_was_hit_on low_was_hit_on
0 8.746911 8.712824 9.243329 0 0
1 9.472977 10.190079 9.744083 0 3
2 9.445111 10.269676 9.859353 0 2
3 9.972061 10.014381 9.132204 1 1
4 8.934692 8.914729 11.453276 0 0
What I did here is iterating over the range of the columns, and just checking where the highest value (hence, index) starting from that row is in the map that we created before.
Now we can easily check which happened first by doing:
# Here you can customize what you need the results to be
# when two hits happen at the same time
df["high_after_low"] = df.high_was_hit_on < df.low_was_hit_on
In terms of speed, this is the test over a df
with 1M rows:
%timeit find_first_hit(df)
3.46 s ± 253 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Truth be told, this isn’t exactly vectorized, but I can’t think of anything that you could do to really achieve that here. Hope that my solution is helpful anyway.
Solution 1 (Iterative):
def high_after_low(high_targets,
low_targets,
vals,
dic):
"""True, if the high target was hit before low; else, False.
Args:
high_target: A NumPy array, the high targets
of each row.
low_target: A NumPy array, the low targets
of each row.
vals: A NumPy array, the current values
dic: A dictionary; maps every row of the targets to the open values.
Returns:
high_after_low: A pandas series with True, False, and None values per row.
Meaning of each value:
True: the timeseries hit high after low (or low was never hit)
False: the timeseries hit high before low (or high was never hit)
None: (1) neither low or high were hit, (2) low and high were hit
at the same row
"""
dic_keys = list(dic.keys())
size = len(dic_keys)
high_hit_rows = [(((a).argmax()+dic[i] ) if((a:= (vals[dic[i]:]
>= high_target[i])).any()) else np.nan ) for i in range(size)]
low_hit_rows = [(((b).argmax()+dic[i]) if((b:= (vals[dic[i]:]
<= low_target[i])).any()) else np.nan ) for i in range(size)]
high_hit_rows = np.array(high_hit_rows, dtype=np.float32)
low_hit_rows = np.array(low_hit_rows, dtype=np.float32)
high_after_low = np.empty((size))
high_after_low[:] = np.nan
high_after_low[np.isnan(low_hit_rows) & (~np.isnan(high_hit_rows))] = False
high_after_low[(~np.isnan(low_hit_rows)) & np.isnan(high_hit_rows)] = True
high_after_low[((~np.isnan(low_hit_rows)) & (~np.isnan(high_hit_rows))
& (low_hit_rows < high_hit_rows))] = True
high_after_low[((~np.isnan(low_hit_rows)) & (~np.isnan(high_hit_rows))
& (low_hit_rows > high_hit_rows))] = False
return high_after_low
Solution 1 (Vectorized):
The vectorized solution requires pre-processing of the input array. We need to pre-process the input array into a 2d array such that the i-th row contains the array’s values from [i:i+T]. Then,
def vectorized_high_after_low(df, high_values, low_values):
"""Args:
df: A pandas DataFrame, containing each value per row. Each column contains the values or row i, t rows ahead.
high_values: The high targets corresponding to each row
low_values: The low targets corresponding to each row.
"""
higher = (df.ge(high_values)).idxmax(axis=1)
lower = (df.le(low_values)).idxmax(axis=1)
higher[higher==0] = df.shape[1]
lower[lower==0] = df.shape[1]
high_after_low = higher < lower
high_after_low[higher==lower] = np.nan
return high_after_low