How can this for loop be written to process faster in Python?

Question:

I’m not familiar enough with Python to understand how I can make a for loop go faster. Here’s what I’m trying to do.

Let’s say we have the following dataframe of prices.

import pandas as pd

df = pd.DataFrame.from_dict({'price': {0: 98, 1: 99, 2: 101, 3: 99, 4: 97, 5: 100, 6: 100, 7: 98}})

The goal is to create a new column called updown, which classifies each row as "up" or "down", signifying what comes first when looking at each subsequent row – up by 2, or down by 2.

df['updown'] = 0
for i in range(df.shape[0]):
    j=0
    while df.price.iloc[i+j] < (df.price.iloc[i] + 2) and df.price.iloc[i+j] > (df.price.iloc[i] - 2):
        j= j+1
    if df.price.iloc[i+j] >= (df.price.iloc[i] + 2):
        df.updown.iloc[i] = "Up"
    if df.price.iloc[i+j] <= (df.price.iloc[i] - 2):
        df.updown.iloc[i] = "Down"

This works just fine, but simply runs too slow when running on millions of rows. Note that I am aware the code throws an error once it gets to the last row, which is fine with me.

Where can I learn how to make something like this happen much faster (ideally seconds, or at least minutes, as opposed to 10+ hours, which is how long it takes right now.

Asked By: batataman

||

Answers:

There are various errors stopping the example code from working, at least for me. Could you please confirm this is what you want the algorithm to do?

import pandas as pd

df = pd.DataFrame.from_dict({'price': {0: 98, 1: 99, 2: 101, 3: 99, 4: 97, 5: 100, 6: 100, 7: 98}})

df['updown'] = 0
count = df.shape[0]
for i in range(count):
    j = 1
    up = df.price.iloc[i] + 2
    down = up - 4
    while (pos := i + j) < count:
        if(value := df.price.iloc[pos]) >= up:
            df.loc[i, 'updown'] = "Up"
            break
        elif value <= down:
            df.loc[i, 'updown'] = "Down"
            break
        else:
            j = j + 1

print(df)

Answered By: John M.

Running through a bunch of different examples, the second method in the following code is approximate x75 faster for the example dataset:

import pandas as pd, numpy as np
from random import randint
import time

data = [randint(90, 120) for i in range(10000)]

df1 = pd.DataFrame({'price': data})

t0 = time.time()
df1['updown'] = np.nan
count = df1.shape[0]
for i in range(count):
    j = 1
    up = df1.price.iloc[i] + 2
    down = up - 4
    while (pos := i + j) < count:
        if(value := df1.price.iloc[pos]) >= up:
            df1.loc[i, 'updown'] = "Up"
            break
        elif value <= down:
            df1.loc[i, 'updown'] = "Down"
            break
        else:
            j = j + 1

t1 = time.time()

print(f'Method 1: {t1 - t0}')
res1 = df1.head()

df2 = pd.DataFrame({'price': data})
t2 = time.time()

count = len(df2)
df2['updown'] = np.nan
up = df2.price + 2
down = df2.price - 2

# increase shift range until updown is set for all columns
# or there is insufficient data to change remaining rows
i = 1
while (i < count) and (not (isna := df2.updown.isna()) is None and ((i == 1) or (isna[:-(i - 1)].any()))):
    shift = df2.price.shift(-i)
    df2.loc[isna & (shift >= up), 'updown'] =  'Up'
    df2.loc[isna & (shift <= down), 'updown'] = 'Down'
    i += 1

t3 = time.time()
print(f'Method 2: {t3 - t2}')

s1 = df1.updown
s2 = df2.updown

match = (s1.isnull() == s2.isnull()).all() and (s1[s1.notnull()] == s2[s2.notnull()]).all()

print(f'Series match: {match}')

The main reason for the speed improvement is instead of iterating across the rows in python, we are doing operations on arrays of data which will all happen in C code. While python calling into pandas or numpy (which are C libraries) is quite quick, there is some overhead, and if you are doing this lots of time it very quickly becomes the limiting factor.

The performance increase is dependent on input data, but scales with the number of rows in the dataframe: the more rows the slower it is to iterate:

   iterations     method1   method2     increase
0         100    0.056002  0.018267     3.065689
1        1000    0.209895  0.005000    41.982070
2       10000    2.625701  0.009001   291.727054
3      100000  108.080149  0.042001  2573.260448
Answered By: John M.
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.