How to apply a function on every row on a dataframe?

Question:

I am new to Python and I am not sure how to solve the following problem.

I have a function:

def EOQ(D,p,ck,ch):
    Q = math.sqrt((2*D*ck)/(ch*p))
    return Q

Say I have the dataframe

df = pd.DataFrame({"D": [10,20,30], "p": [20, 30, 10]})

    D   p
0   10  20
1   20  30
2   30  10

ch=0.2
ck=5

And ch and ck are float types. Now I want to apply the formula to every row on the dataframe and return it as an extra row ‘Q’. An example (that does not work) would be:

df['Q']= map(lambda p, D: EOQ(D,p,ck,ch),df['p'], df['D']) 

(returns only ‘map’ types)

I will need this type of processing more in my project and I hope to find something that works.

Asked By: Koen

||

Answers:

The following should work:

def EOQ(D,p,ck,ch):
    Q = math.sqrt((2*D*ck)/(ch*p))
    return Q
ch=0.2
ck=5
df['Q'] = df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
df

If all you’re doing is calculating the square root of some result then use the np.sqrt method this is vectorised and will be significantly faster:

In [80]:
df['Q'] = np.sqrt((2*df['D']*ck)/(ch*df['p']))

df
Out[80]:
    D   p          Q
0  10  20   5.000000
1  20  30   5.773503
2  30  10  12.247449

Timings

For a 30k row df:

In [92]:

import math
ch=0.2
ck=5
def EOQ(D,p,ck,ch):
    Q = math.sqrt((2*D*ck)/(ch*p))
    return Q

%timeit np.sqrt((2*df['D']*ck)/(ch*df['p']))
%timeit df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
1000 loops, best of 3: 622 µs per loop
1 loops, best of 3: 1.19 s per loop

You can see that the np method is ~1900 X faster

Answered By: EdChum

There are few more ways to apply a function on every row of a DataFrame.

(1) You could modify EOQ a bit by letting it accept a row (a Series object) as argument and access the relevant elements using the column names inside the function. Moreover, you can pass arguments to apply using its keyword, e.g. ch or ck:

def EOQ1(row, ck, ch):
    Q = math.sqrt((2*row['D']*ck)/(ch*row['p']))
    return Q

df['Q1'] = df.apply(EOQ1, ck=ck, ch=ch, axis=1)

(2) It turns out that apply is often slower than a list comprehension (in the benchmark below, it’s 20x slower). To use a list comprehension, you could modify EOQ still further so that you access elements by its index. Then call the function in a loop over df rows that are converted to lists:

def EOQ2(row, ck, ch):
    Q = math.sqrt((2*row[0]*ck)/(ch*row[1]))
    return Q

df['Q2a'] = [EOQ2(x, ck, ch) for x in df[['D','p']].to_numpy().tolist()]

(3) As it happens, if the goal is to call a function iteratively, map is usually faster than a list comprehension. So you could convert df into a list, map the function to it; then unpack the result in a list:

df['Q2b'] = [*map(EOQ2, df[['D','p']].to_numpy().tolist(), [ck]*len(df), [ch]*len(df))]

(4) As @EdChum notes, it’s always better to use vectorized methods if it’s possible to do so, instead of applying a function row by row. Pandas offers vectorized methods that rival that of numpy’s. In the case of EOQ for example, instead of math.sqrt, you could use pandas’ pow method (in the benchmark below, using pandas vectorized methods is ~20% faster than using numpy):

df['Q_pd'] = df['D'].mul(2*ck).div(ch*df['p']).pow(0.5)

Output:

    D   p          Q       Q_np         Q1        Q2a        Q2b       Q_pd
0  10  20   5.000000   5.000000   5.000000   5.000000   5.000000   5.000000
1  20  30   5.773503   5.773503   5.773503   5.773503   5.773503   5.773503
2  30  10  12.247449  12.247449  12.247449  12.247449  12.247449  12.247449

Timings:

df = pd.DataFrame({"D": [10,20,30], "p": [20, 30, 10]})
df = pd.concat([df]*10000)

>>> %timeit df['Q'] = df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
623 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df['Q1'] = df.apply(EOQ1, ck=ck, ch=ch, axis=1)
615 ms ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df['Q2a'] = [EOQ2(x, ck, ch) for x in df[['D','p']].to_numpy().tolist()]
31.3 ms ± 479 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit df['Q2b'] = [*map(EOQ2, df[['D','p']].to_numpy().tolist(), [ck]*len(df), [ch]*len(df))]
26.9 ms ± 306 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit df['Q_np'] = np.sqrt((2*df['D']*ck)/(ch*df['p']))
1.19 ms ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit df['Q_pd'] = df['D'].mul(2*ck).div(ch*df['p']).pow(0.5)
966 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Answered By: user7864386
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.