Vectorized way to find first occurrence per row

Question:

I have two Pandas DataFrames df_x and df_y. df_x has two columns ‘high target’ and ‘low target’. Per every row of df_x, I would like to search through the instances of df_y and see whether the ‘high target’ was reached before the ‘low target’. Currently, I implemented the above using .apply. However, my code is too inefficient as it linearly scales with the number of rows in df_x. Any suggestions to optimize/vectorize my code?

df_x

df_y

def efficient_high_after_low(row, minute_df):
    """True, if high happened after the low, else False.
    Args:
        row: A pandas dataframe
             with high, low,
        minute_df: the whole dataframe
    """

    minute_df_after = minute_df.loc[row.period_end_idx+pd.Timedelta(minutes=1):]
    #print(minute_df_after)
    first_highs = (minute_df_after.ge(row['high target']))
    first_lows = (minute_df_after.le(row['low target']))
    
    hi_sum, lo_sum = first_highs.sum(), first_lows.sum()
    if (len(first_highs) != len(first_lows)):
        raise Exception('Unequal length of first_highs and first_lows')
    else:
        if ((len(first_highs) == 0)):
            return None
    
        elif ((hi_sum == 0) & (lo_sum != 0)):
            return True
        elif ((hi_sum != 0) & (low_sum == 0)):
            return False
        elif ((hi_sum == 0) & (low_sum == 0)):
            return None 
        elif (first_highs.idxmax() > first_lows.idxmax()):
            return True
        elif(first_highs.idxmax() < first_lows.idxmax()):
            return False
        else:
            return None

And I do the following to get these boolean values:

df_x.apply(efficient_high_after_low, axis=1, args=(df_y['open'],))

Running the code above on the first 1000 lines takes 4 seconds.

Asked By: Vanillihoot

||

Answers:

This is what you could do:

First of all put the open column in your main dataframe, let’s call it df (note: this only works if you have the exact same index on df_y, if you don’t, consider other solutions like pd.concat or pd.merge_asof)

df = df_x
df["open"] = df_y["open"]

I also took the liberty of renaming your columns.

As long as your timeseries index is ordered, we can reset the index with

df = df.reset_index()

So now we have df something like this (values are made up):

   high_trgt    low_trgt    open
0   8.746911    8.712824    9.243329
1   9.472977    10.190079   9.744083
2   9.445111    10.269676   9.859353
3   9.972061    10.014381   9.132204
4   8.934692    8.914729    11.453276

# You "time" column isn't actually necessary for this solution

We can create a map of where the targets have been hit

map_high = df.open.values >= df.high_trgt.values
map_low = df.open.values <= df.low_trgt.values

Now the resource intensive bit:

df["high_was_hit_on"] = pd.Series([map_high[i+1:].argmax() for i in range(len(map_high)-1)])
df["low_was_hit_on"] = pd.Series([map_low[i+1:].argmax() for i in range(len(map_low)-1)])

Output:

    high_trgt   low_trgt    open        high_was_hit_on low_was_hit_on
0   8.746911    8.712824    9.243329    0               0
1   9.472977    10.190079   9.744083    0               3
2   9.445111    10.269676   9.859353    0               2
3   9.972061    10.014381   9.132204    1               1
4   8.934692    8.914729    11.453276   0               0

What I did here is iterating over the range of the columns, and just checking where the highest value (hence, index) starting from that row is in the map that we created before.

Now we can easily check which happened first by doing:

# Here you can customize what you need the results to be
# when two hits happen at the same time
df["high_after_low"] = df.high_was_hit_on < df.low_was_hit_on

In terms of speed, this is the test over a df with 1M rows:

%timeit find_first_hit(df)
3.46 s ± 253 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Truth be told, this isn’t exactly vectorized, but I can’t think of anything that you could do to really achieve that here. Hope that my solution is helpful anyway.

Answered By: 965311532

Solution 1 (Iterative):

def high_after_low(high_targets, 
                   low_targets, 
                   vals, 
                   dic):

    """True, if the high target was hit before low; else, False.
    Args:
        high_target: A NumPy array, the high targets 
                                of each row.
        low_target: A NumPy array, the low targets 
                                of each row.
        vals: A NumPy array, the current values
        dic: A dictionary; maps every row of the targets to the open values.
            
    Returns:
        high_after_low: A pandas series with True, False, and None values per row.
        Meaning of each value:
            True: the timeseries hit high after low (or low was never hit)
            False: the timeseries hit high before low (or high was never hit)
            None: (1) neither low or high were hit, (2) low and high were hit 
                    at the same row
    """

    dic_keys = list(dic.keys())
    size = len(dic_keys)
    
    high_hit_rows = [(((a).argmax()+dic[i] ) if((a:= (vals[dic[i]:] 
                    >= high_target[i])).any()) else np.nan ) for i in range(size)]
    low_hit_rows  = [(((b).argmax()+dic[i]) if((b:= (vals[dic[i]:] 
                    <= low_target[i])).any()) else np.nan ) for i in range(size)]
    
    high_hit_rows = np.array(high_hit_rows, dtype=np.float32)
    low_hit_rows = np.array(low_hit_rows, dtype=np.float32)

    high_after_low = np.empty((size))
    
    high_after_low[:] = np.nan

    high_after_low[np.isnan(low_hit_rows) & (~np.isnan(high_hit_rows))] = False
    high_after_low[(~np.isnan(low_hit_rows)) & np.isnan(high_hit_rows)] = True
    high_after_low[((~np.isnan(low_hit_rows)) & (~np.isnan(high_hit_rows)) 
                    & (low_hit_rows < high_hit_rows))] = True
    high_after_low[((~np.isnan(low_hit_rows)) & (~np.isnan(high_hit_rows)) 
                    & (low_hit_rows > high_hit_rows))] = False
    
    return high_after_low

Solution 1 (Vectorized):

The vectorized solution requires pre-processing of the input array. We need to pre-process the input array into a 2d array such that the i-th row contains the array’s values from [i:i+T]. Then,

def vectorized_high_after_low(df, high_values, low_values):
    """Args:
       df: A pandas DataFrame, containing each value per row. Each column contains the values or row i, t rows ahead.
       high_values: The high targets corresponding to each row
       low_values: The low targets corresponding to each row.
    """
    higher = (df.ge(high_values)).idxmax(axis=1)
    lower = (df.le(low_values)).idxmax(axis=1)

    higher[higher==0] = df.shape[1]
    lower[lower==0] = df.shape[1]
    high_after_low = higher < lower
    high_after_low[higher==lower] = np.nan
    
    return high_after_low
Answered By: Vanillihoot