How to efficiently apply a function to every row in a dataframe
Question:
Given the following table:
df = pd.DataFrame({'code':['100M','60M10N40M','5S99M','1S25I100M','1D1S1I200M']})
that looks like this:
code
0 100M
1 60M10N40M
2 5S99M
3 1S25I100M
4 1D1S1I200M
I’d like to convert the code
column strings to numbers where M, N, D are each equivalent to (times 1), I is equivalent to (times -1) and S is equivalent to (times 0).
The result should look like this:
code Val
0 100M 100 This is (100*1)
1 60M10N40M 110 This is (60*1)+(10*1)+(40*1)
2 5S99M 99 This is (5*0)+(99*1)
3 1S25I100M 75 This is (1*0)+(25*-1)+(100*1)
4 1D1S1I200M 200 This is (1*1)+(1*0)+(1*-1)+(200*1)
I wrote the following function to this:
def String2Val(String):
# Generate substrings
sstrings = re.findall('.[^A-Z]*.', String)
KeyDict = {'M':'*1','N':'*1','I':'*-1','S':'*0','D':'*1'}
newlist = []
for key, value in KeyDict.items():
for i in sstrings:
if key in i:
p = i.replace(key, value)
lp = eval(p)
newlist.append(lp)
OutputVal = sum(newlist)
return OutputVal
df['Val'] = df.apply(lambda row: String2Val(row['code']), axis = 1)
After applying this function to the table, I realized it’s not efficient and takes forever when applied to large datasets. How can I optimize this process?
Answers:
You can add the addition symbol to the value of KeyDict
then replace the value of code
column by the KeyDict
and at last call pd.eval
to do the calculation.
KeyDict = {'M':'*1+','N':'*1+','I':'*-1+','S':'*0+','D':'*1+'}
df['val'] = (df['code'].replace(KeyDict, regex=True)
.str.rstrip('+').apply(pd.eval))
# or you can use native python for loop since Series.apply is not efficient
df['val'] = [pd.eval(val) for val in df['code'].replace(KeyDict, regex=True).str.rstrip('+')]
print(df)
code val
0 100M 100
1 60M10N40M 110
2 5S99M 99
3 1S25I100M 75
4 1D1S1I200M 200
You can try the following solution that uses replace()
:
import pandas as pd
def String2Val(row):
# Use replace to find an replace characters according to your KeyDict definition
val = row.replace('M', '*1+').replace('N', '*1+').replace('I', '*-1+').replace('S', '*0+').replace('D', '*1+')
# Ensure the last part of the string isn't a +
if val[-1] == "+":
# If it is, remove the + from the end
val = val[:-1]
# Return the evaluated value
return eval(val)
df = pd.DataFrame({'code':['100M','60M10N40M','5S99M','1S25I100M','1D1S1I200M']})
# Modify it to use apply only on the code column. Which removes the need to use lambda and axis=1
df['Val'] = df['code'].apply(String2Val)
df
:
code Val
0 100M 100
1 60M10N40M 110
2 5S99M 99
3 1S25I100M 75
4 1D1S1I200M 200
Since pandas string methods are not optimized (although that seems to no longer be true for pandas 2.0), if you’re after performance, it’s better to use Python string methods in a loop (which are compiled in C). It seems a straightforward loop over each string might give the best performance.
def evaluater(s):
total, curr = 0, ''
for e in s:
# if a number concatenate to the previous number
if e.isdigit():
curr += e
# if a string, look up its value in KeyDict
# and multiply the currently collected number by it
# and add to the total
else:
total += int(curr) * KeyDict[e]
curr = ''
return total
KeyDict = {'M': 1, 'N': 1, 'I': -1, 'S': 0, 'D': 1}
df['val'] = df['code'].map(evaluater)
Performance:
KeyDict1 = {'M':'*1+','N':'*1+','I':'*-1+','S':'*0+','D':'*1+'}
df = pd.DataFrame({'code':['100M','60M10N40M','5S99M','1S25I100M','1D1S1I200M']*1000})
%timeit df.assign(val=df['code'].map(evaluater))
# 12.2 ms ± 579 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.assign(val=df['code'].apply(String2Val)) # @Marcelo Paco
# 61.8 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(val=df['code'].replace(KeyDict1, regex=True).str.rstrip('+').apply(pd.eval)) # @Ynjxsjmh
# 4.86 s ± 155 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
N.B. You already implement something similar but the outer loop (for key, value in KeyDict.items()
) is unnecessary; since KeyDict
is a dictionary, use it as a lookup table; don’t loop. Also, .apply(axis=1)
is a really bad way to loop when only a single column is relevant. Select that column and call apply()
.
Another possible solution, which replaces the letters by the corresponding multiplicative factor and, then, evaluates the strings with eval
:
df['val'] = (df['code'].str.replace('M|N|D', '*1+', regex=True)
.str.replace('I', '*(-1)+', regex=True)
.str.replace('S', '*0+', regex=True)
.str.replace(r'+$', '', regex=True).map(eval))
Output:
code val
0 100M 100
1 60M10N40M 110
2 5S99M 99
3 1S25I100M 75
4 1D1S1I200M 200
Given the following table:
df = pd.DataFrame({'code':['100M','60M10N40M','5S99M','1S25I100M','1D1S1I200M']})
that looks like this:
code
0 100M
1 60M10N40M
2 5S99M
3 1S25I100M
4 1D1S1I200M
I’d like to convert the code
column strings to numbers where M, N, D are each equivalent to (times 1), I is equivalent to (times -1) and S is equivalent to (times 0).
The result should look like this:
code Val
0 100M 100 This is (100*1)
1 60M10N40M 110 This is (60*1)+(10*1)+(40*1)
2 5S99M 99 This is (5*0)+(99*1)
3 1S25I100M 75 This is (1*0)+(25*-1)+(100*1)
4 1D1S1I200M 200 This is (1*1)+(1*0)+(1*-1)+(200*1)
I wrote the following function to this:
def String2Val(String):
# Generate substrings
sstrings = re.findall('.[^A-Z]*.', String)
KeyDict = {'M':'*1','N':'*1','I':'*-1','S':'*0','D':'*1'}
newlist = []
for key, value in KeyDict.items():
for i in sstrings:
if key in i:
p = i.replace(key, value)
lp = eval(p)
newlist.append(lp)
OutputVal = sum(newlist)
return OutputVal
df['Val'] = df.apply(lambda row: String2Val(row['code']), axis = 1)
After applying this function to the table, I realized it’s not efficient and takes forever when applied to large datasets. How can I optimize this process?
You can add the addition symbol to the value of KeyDict
then replace the value of code
column by the KeyDict
and at last call pd.eval
to do the calculation.
KeyDict = {'M':'*1+','N':'*1+','I':'*-1+','S':'*0+','D':'*1+'}
df['val'] = (df['code'].replace(KeyDict, regex=True)
.str.rstrip('+').apply(pd.eval))
# or you can use native python for loop since Series.apply is not efficient
df['val'] = [pd.eval(val) for val in df['code'].replace(KeyDict, regex=True).str.rstrip('+')]
print(df)
code val
0 100M 100
1 60M10N40M 110
2 5S99M 99
3 1S25I100M 75
4 1D1S1I200M 200
You can try the following solution that uses replace()
:
import pandas as pd
def String2Val(row):
# Use replace to find an replace characters according to your KeyDict definition
val = row.replace('M', '*1+').replace('N', '*1+').replace('I', '*-1+').replace('S', '*0+').replace('D', '*1+')
# Ensure the last part of the string isn't a +
if val[-1] == "+":
# If it is, remove the + from the end
val = val[:-1]
# Return the evaluated value
return eval(val)
df = pd.DataFrame({'code':['100M','60M10N40M','5S99M','1S25I100M','1D1S1I200M']})
# Modify it to use apply only on the code column. Which removes the need to use lambda and axis=1
df['Val'] = df['code'].apply(String2Val)
df
:
code Val
0 100M 100
1 60M10N40M 110
2 5S99M 99
3 1S25I100M 75
4 1D1S1I200M 200
Since pandas string methods are not optimized (although that seems to no longer be true for pandas 2.0), if you’re after performance, it’s better to use Python string methods in a loop (which are compiled in C). It seems a straightforward loop over each string might give the best performance.
def evaluater(s):
total, curr = 0, ''
for e in s:
# if a number concatenate to the previous number
if e.isdigit():
curr += e
# if a string, look up its value in KeyDict
# and multiply the currently collected number by it
# and add to the total
else:
total += int(curr) * KeyDict[e]
curr = ''
return total
KeyDict = {'M': 1, 'N': 1, 'I': -1, 'S': 0, 'D': 1}
df['val'] = df['code'].map(evaluater)
Performance:
KeyDict1 = {'M':'*1+','N':'*1+','I':'*-1+','S':'*0+','D':'*1+'}
df = pd.DataFrame({'code':['100M','60M10N40M','5S99M','1S25I100M','1D1S1I200M']*1000})
%timeit df.assign(val=df['code'].map(evaluater))
# 12.2 ms ± 579 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.assign(val=df['code'].apply(String2Val)) # @Marcelo Paco
# 61.8 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(val=df['code'].replace(KeyDict1, regex=True).str.rstrip('+').apply(pd.eval)) # @Ynjxsjmh
# 4.86 s ± 155 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
N.B. You already implement something similar but the outer loop (for key, value in KeyDict.items()
) is unnecessary; since KeyDict
is a dictionary, use it as a lookup table; don’t loop. Also, .apply(axis=1)
is a really bad way to loop when only a single column is relevant. Select that column and call apply()
.
Another possible solution, which replaces the letters by the corresponding multiplicative factor and, then, evaluates the strings with eval
:
df['val'] = (df['code'].str.replace('M|N|D', '*1+', regex=True)
.str.replace('I', '*(-1)+', regex=True)
.str.replace('S', '*0+', regex=True)
.str.replace(r'+$', '', regex=True).map(eval))
Output:
code val
0 100M 100
1 60M10N40M 110
2 5S99M 99
3 1S25I100M 75
4 1D1S1I200M 200