How to efficiently apply a function to every row in a dataframe

Question:

Given the following table:

df = pd.DataFrame({'code':['100M','60M10N40M','5S99M','1S25I100M','1D1S1I200M']})

that looks like this:

    code
0   100M
1   60M10N40M
2   5S99M
3   1S25I100M
4   1D1S1I200M

I’d like to convert the code column strings to numbers where M, N, D are each equivalent to (times 1), I is equivalent to (times -1) and S is equivalent to (times 0).

The result should look like this:

     code       Val
0   100M        100     This is (100*1)
1   60M10N40M   110     This is (60*1)+(10*1)+(40*1)
2   5S99M       99      This is (5*0)+(99*1)
3   1S25I100M   75      This is (1*0)+(25*-1)+(100*1)
4   1D1S1I200M  200     This is (1*1)+(1*0)+(1*-1)+(200*1)

I wrote the following function to this:

def String2Val(String):
    # Generate substrings
    sstrings = re.findall('.[^A-Z]*.', String)

    KeyDict = {'M':'*1','N':'*1','I':'*-1','S':'*0','D':'*1'}

    newlist = []
    for key, value in KeyDict.items():
        for i in sstrings:
            if key in i:
                p = i.replace(key, value)
                lp = eval(p)
                newlist.append(lp)

    OutputVal = sum(newlist)
    return OutputVal

df['Val'] = df.apply(lambda row: String2Val(row['code']), axis = 1)

After applying this function to the table, I realized it’s not efficient and takes forever when applied to large datasets. How can I optimize this process?

Asked By: newbzzs

||

Answers:

You can add the addition symbol to the value of KeyDict then replace the value of code column by the KeyDict and at last call pd.eval to do the calculation.

KeyDict = {'M':'*1+','N':'*1+','I':'*-1+','S':'*0+','D':'*1+'}


df['val'] = (df['code'].replace(KeyDict, regex=True)
             .str.rstrip('+').apply(pd.eval))
# or you can use native python for loop since Series.apply is not efficient
df['val'] = [pd.eval(val) for val in df['code'].replace(KeyDict, regex=True).str.rstrip('+')]
print(df)

         code  val
0        100M  100
1   60M10N40M  110
2       5S99M   99
3   1S25I100M   75
4  1D1S1I200M  200
Answered By: Ynjxsjmh

You can try the following solution that uses replace():

import pandas as pd

def String2Val(row):
    # Use replace to find an replace characters according to your KeyDict definition
    val = row.replace('M', '*1+').replace('N', '*1+').replace('I', '*-1+').replace('S', '*0+').replace('D', '*1+')
    # Ensure the last part of the string isn't a +
    if val[-1] == "+":
        # If it is, remove the + from the end
        val = val[:-1]
    # Return the evaluated value
    return eval(val)

df = pd.DataFrame({'code':['100M','60M10N40M','5S99M','1S25I100M','1D1S1I200M']})
# Modify it to use apply only on the code column. Which removes the need to use lambda and axis=1
df['Val'] = df['code'].apply(String2Val)

df:

         code  Val
0        100M  100
1   60M10N40M  110
2       5S99M   99
3   1S25I100M   75
4  1D1S1I200M  200
Answered By: Marcelo Paco

Since pandas string methods are not optimized (although that seems to no longer be true for pandas 2.0), if you’re after performance, it’s better to use Python string methods in a loop (which are compiled in C). It seems a straightforward loop over each string might give the best performance.

def evaluater(s):
    total, curr = 0, ''
    for e in s:
        # if a number concatenate to the previous number
        if e.isdigit():
            curr += e
        # if a string, look up its value in KeyDict
        # and multiply the currently collected number by it
        # and add to the total
        else:
            total += int(curr) * KeyDict[e]
            curr = ''
    return total

KeyDict = {'M': 1, 'N': 1, 'I': -1, 'S': 0, 'D': 1}
df['val'] = df['code'].map(evaluater)

Performance:

KeyDict1 = {'M':'*1+','N':'*1+','I':'*-1+','S':'*0+','D':'*1+'}
df = pd.DataFrame({'code':['100M','60M10N40M','5S99M','1S25I100M','1D1S1I200M']*1000})

%timeit df.assign(val=df['code'].map(evaluater))
# 12.2 ms ± 579 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.assign(val=df['code'].apply(String2Val))    # @Marcelo Paco
# 61.8 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(val=df['code'].replace(KeyDict1, regex=True).str.rstrip('+').apply(pd.eval))   # @Ynjxsjmh
# 4.86 s ± 155 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

N.B. You already implement something similar but the outer loop (for key, value in KeyDict.items()) is unnecessary; since KeyDict is a dictionary, use it as a lookup table; don’t loop. Also, .apply(axis=1) is a really bad way to loop when only a single column is relevant. Select that column and call apply().

Answered By: cottontail

Another possible solution, which replaces the letters by the corresponding multiplicative factor and, then, evaluates the strings with eval:

df['val'] = (df['code'].str.replace('M|N|D', '*1+', regex=True)
             .str.replace('I', '*(-1)+', regex=True)
             .str.replace('S', '*0+', regex=True)
             .str.replace(r'+$', '', regex=True).map(eval))

Output:

         code  val
0        100M  100
1   60M10N40M  110
2       5S99M   99
3   1S25I100M   75
4  1D1S1I200M  200
Answered By: PaulS

Thanks all. In case this may be helpful to others, I checked the performance of everyone’s solution on data of different sizes and obtained the following results. Very educational, great tips to be applied when working with large datasets.

enter image description here

Answered By: newbzzs
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.