Problem with looping over pandas dataframe

Question:

I have a dataframe (df) of 100 (will be more eventually) rows and 2689 columns. I have to apply a function say fn1(df,a,b) on each row. The fn1 itself has a for loop, which I cannot avoid. So, I want to speed up applying fn1 to all rows of df. The a and b arguments of fn1 are some other arguments I need to compute some value inside fn1. I saw a lot about not to use itertools, using apply() and best-being vectorization. ** However, I cannot see any time difference when I use these. I use them as below. Can anyone please correct the commands so that time speedup happens? I provide a smaller case of df of size 5*2413 here.

My first naive approach:

for i in df.values:
#performing the operation

Second approach:

df['f']= df.apply(lambda row: fn1(row,a,b), axis=1, raw=True)
#taking time 1.2 for df of size 5*2689

Third approach:

df['f']=fn1(df.values,a,b)
#same time

Fourth approach:

    vectfunc=np.vectorize(fn1)       
    df['f'] = vectfunc(df[df.columns[0:]],a, b)  
    #or
   df['f'] = vectfunc(df,a, b)

In this fourth approach, both give me an error ‘numpy.float64’ object is not iterable because inside fn1 I do an operation list(whatever I get as the row in the 1st argument) before proceeding to perform my fn1 operations. Whereas df[df.columns[0:]] or df is making the 1st argument only to have 1 element i.e. 1st element of row 1 instead of entire row.

Can anyone please suggest how to speed up in this case? If you think the data size is small to conclude, then please atleast correct my fourth approach so that i can apply vectorization correctly.

Update: I see np.vectorize doesn’t speed up per se. In that case, what about pandas vectorization? Can I apply it here? if so, how may I do it. Please see my comment below.

Note: Anyone reading the accepted answer may also read my final comment that says how I ended up speeding up the code.

Asked By: knowledge_seeker

||

Answers:

You could try parallelizing the computation using the multiprocessing library.

This is how you can do it:

import pandas as pd
import numpy as np
from multiprocessing import Pool

def fn1(row, a, b):
    # Implement your function here. I used a + b for simplicity
    result = a + b
    return result

def fn1_wrapper(args):
    row, a, b = args
    return fn1(row, a, b)

# I cannot find your DF sample you mentioned in your question, so I created a fake one to test it
df = pd.DataFrame(np.random.rand(100, 2689))
a = 0.5
b = 0.25

with Pool(processes=4) as pool:  # Adjust the 'processes' parameter based on your CPU cores
    results = pool.map(fn1_wrapper, [(row, a, b) for _, row in df.iterrows()])

# Assigning the results to a new column in the DataFrame
df['final'] = results
Answered By: Tasos

Looping at Python level over a Pandas DataFrame is slow. Full stop. The recommended approaches are to only use pandas or numpy functions, because they are written in C language under the hood and are much faster when processing numeric values.

If this is not an option, the alternative solution is to use Cython or Numba to convert the function containing the for loop into a C native function that could then be used a C speed. Not as straightforward as one could hope, but can be very effective.

Answered By: Serge Ballesta