pandas apply and applymap functions are taking long time to run on large dataset
Question:
I have two functions applied on a dataframe
res = df.apply(lambda x:pd.Series(list(x)))
res = res.applymap(lambda x: x.strip('"') if isinstance(x, str) else x)
{{Update}} Dataframe has got almost 700 000 rows. This is taking much time to run.
How to reduce the running time?
Sample data :
A
----------
0 [1,4,3,c]
1 [t,g,h,j]
2 [d,g,e,w]
3 [f,i,j,h]
4 [m,z,s,e]
5 [q,f,d,s]
output:
A B C D E
-------------------------
0 [1,4,3,c] 1 4 3 c
1 [t,g,h,j] t g h j
2 [d,g,e,w] d g e w
3 [f,i,j,h] f i j h
4 [m,z,s,e] m z s e
5 [q,f,d,s] q f d s
This line of code res = df.apply(lambda x:pd.Series(list(x)))
takes items from a list and fill one by one to each column as shown above. There will be almost 38 columns.
Answers:
I think:
res = df.apply(lambda x:pd.Series(list(x)))
should be changed to:
df1 = pd.DataFrame(df['A'].values.tolist())
print (df1)
0 1 2 3
0 1 4 3 c
1 t g h j
2 d g e w
3 f i j h
4 m z s e
5 q f d s
And second if not mixed columns values – numeric with strings:
cols = res.select_dtypes(object).columns
res[cols] = res[cols].apply(lambda x: x.str.strip('"'))
Late reply perhaps, but for people like me who stumble on this topic with the same question, it may be nonetheless worth adding what I found.
I used the swifter library. An apply function on a Pandas dataframe goes at least twice as fast, it also consumes less RAM:
import pandas as pd
import swifter
# then add .swifter between df and .apply, as in so...
res = df.swifter.apply(lambda x:pd.Series(list(x)))
And that’s all. It worked really well for me. It also includes a status bar in the terminal which is really helpful as well.
I got the solution from: https://towardsdatascience.com/do-you-use-apply-in-pandas-there-is-a-600x-faster-way-d2497facfa66
I have two functions applied on a dataframe
res = df.apply(lambda x:pd.Series(list(x)))
res = res.applymap(lambda x: x.strip('"') if isinstance(x, str) else x)
{{Update}} Dataframe has got almost 700 000 rows. This is taking much time to run.
How to reduce the running time?
Sample data :
A
----------
0 [1,4,3,c]
1 [t,g,h,j]
2 [d,g,e,w]
3 [f,i,j,h]
4 [m,z,s,e]
5 [q,f,d,s]
output:
A B C D E
-------------------------
0 [1,4,3,c] 1 4 3 c
1 [t,g,h,j] t g h j
2 [d,g,e,w] d g e w
3 [f,i,j,h] f i j h
4 [m,z,s,e] m z s e
5 [q,f,d,s] q f d s
This line of code res = df.apply(lambda x:pd.Series(list(x)))
takes items from a list and fill one by one to each column as shown above. There will be almost 38 columns.
I think:
res = df.apply(lambda x:pd.Series(list(x)))
should be changed to:
df1 = pd.DataFrame(df['A'].values.tolist())
print (df1)
0 1 2 3
0 1 4 3 c
1 t g h j
2 d g e w
3 f i j h
4 m z s e
5 q f d s
And second if not mixed columns values – numeric with strings:
cols = res.select_dtypes(object).columns
res[cols] = res[cols].apply(lambda x: x.str.strip('"'))
Late reply perhaps, but for people like me who stumble on this topic with the same question, it may be nonetheless worth adding what I found.
I used the swifter library. An apply function on a Pandas dataframe goes at least twice as fast, it also consumes less RAM:
import pandas as pd
import swifter
# then add .swifter between df and .apply, as in so...
res = df.swifter.apply(lambda x:pd.Series(list(x)))
And that’s all. It worked really well for me. It also includes a status bar in the terminal which is really helpful as well.
I got the solution from: https://towardsdatascience.com/do-you-use-apply-in-pandas-there-is-a-600x-faster-way-d2497facfa66