Find first occurrence of Price value which is greater than current value in Pandas dataframe using vectorization


lets take this example Pandas dataframe which has two columns [‘date’] and [‘price’]:
[‘date’] is ascending always
[‘price’] is random

df = pd.DataFrame({
'price': [10,2,5,4,12,8,9,19,12,3]

the goal is to add two more columns
[‘next_date’] contains the date of the first occurrence of a price which is greater than current price
[‘next_price’] contains the price of the first occurrence of a price which is greater than current price

like this

         date  price   next_date next_price
0  01/01/2019     10  01/05/2019         12
1  01/02/2019      2  01/03/2019          5
2  01/03/2019      5  01/05/2019         12
3  01/04/2019      4  01/05/2019         12
4  01/05/2019     12  01/08/2019         19
5  01/06/2019      8  01/07/2019          9
6  01/07/2019      9  01/08/2019         19
7  01/08/2019     19         NaN        NaN
8  01/09/2019     12         NaN        NaN
9  01/10/2019      3         NaN        NaN

I’ve test some solutions which did what i want but with very poor performance
the real df has over a million rows

These are my test solutions:

using Pandasql

result = sqldf("SELECT, l.price, min( as next_date from df as l left join df as r on ( > and r.price > l.price) group by, l.price  order by")
result=pd.merge(result ,df, left_on='next_date', right_on='date', suffixes=('', '_next'), how='left')

using Pandas to SQLite

df.to_sql('df', conn, index=False)
qry = "SELECT, l.price, min( as next_date from df as l left join df as r on ( > and r.price > l.price) group by, l.price  order by "
result = pd.read_sql_query(qry, conn)
result=pd.merge(result ,df, left_on='next_date', right_on='date', suffixes=('', '_next'), how='left')

using Apply

def find_next_price(row):
    mask = (df['price'] > row['price']) & (df['date'] > row['date'])
    if len(df[mask]):
        return df[mask]['date'].iloc[0], df[mask]['price'].iloc[0]
        return np.nan, np.nan

df[['next_date', 'next_price']] = list(df.apply(find_next_price, axis=1))

some of these solutions start to fail on 50000 rows df,
while i need to perform this task on a 1000000 rows df

there is a very similar question here: but also poor performance

Asked By: Glory Sayf



Since you need to perform this task on large number of rows (1M+), a traditional approach with numpy might not be feasible especially when you limited amount of memory. Here I’m presenting a functional approach using basic algorithmic computation and you can compile this function with numba's just in time compiler to achieve C like speeds:

import numba

def argmax(price: np.ndarray):
    for i in range(len(price)):
        idx = -1
        for j in range(i + 1, len(price)):
            if price[i] < price[j]:
                idx = j

        yield idx
        idx = -1

i = np.array(list(argmax(df['price'].values)))
m = i != -1 # index is -1 if there's no next greater price

df.loc[m, 'next_date'] = df['date'].values[i[m]]
df.loc[m, 'next_price'] = df['price'].values[i[m]]


         date  price   next_date  next_price
0  01/01/2019     10  01/05/2019        12.0
1  01/02/2019      2  01/03/2019         5.0
2  01/03/2019      5  01/05/2019        12.0
3  01/04/2019      4  01/05/2019        12.0
4  01/05/2019     12  01/08/2019        19.0
5  01/06/2019      8  01/07/2019         9.0
6  01/07/2019      9  01/08/2019        19.0
7  01/08/2019     19         NaN         NaN
8  01/09/2019     12         NaN         NaN
9  01/10/2019      3         NaN         NaN

PS: Solution is tested on 1M+ rows.

Answered By: Shubham Sharma