Why is this apply() custom function is slower in Polars than in Pandas

Question:

I ran the following in a Jupyter Notebook and was disappointed that similar Pandas code is faster. Hoping someone can show a smarter approach in Polars.

POLARS VERSION

def cleanse_text(sentence):
    RIGHT_QUOTE = r"(u2019)"
    sentence = re.sub(RIGHT_QUOTE, "'", sentence)
    sentence = re.sub(r" +", " ", sentence)
    return sentence.strip()
df = df.with_columns(pl.col("text").apply(lambda x: cleanse_text(x)).keep_name()) 

PANDAS VERSION

def cleanse_text(sentence):
    RIGHT_QUOTE = r"(u2019)"
    sentence = re.sub(RIGHT_QUOTE, "'", sentence)    
    sentence = re.sub(r" +", " ", sentence)
    return sentence.strip() 
df["text"] = df["text"].apply(lambda x: cleanse_text(x))

The above Pandas version was 10% faster than the Polars version when I ran this on a dataframe with 750,000 rows of text.

Asked By: Biosopher

||

Answers:

Instead of combining Series.apply with re.sub, you can chain 2 instances of Series.str.replace in this case, and finally add Series.str.strip. This will be faster generally (see end of answer as to why), but particularly for polars.

Pandas version

import pandas as pd
t = "'Hello  Worldu2019 "
df = pd.DataFrame({'text': [t]*750000})

df['text'] = (df['text']
              .str.replace('u2019',"'", regex=True)
              .str.replace(' +',' ', regex=True)
              .str.strip())

df.head()

            text
0  'Hello World'
1  'Hello World'
2  'Hello World'
3  'Hello World'
4  'Hello World'

Polars version

import polars as pl
t = "'Hello  Worldu2019 "
df_pl = pl.DataFrame({'text': [t]*750000})

df_pl = (df_pl
         .with_column(pl.col('text')
                      .str.replace('u2019',"'")
                      .str.replace(' +',' ')
                      .str.strip()))

df_pl.head()

┌───────────────┐
│ text          │
│ ---           │
│ str           │
╞═══════════════╡
│ 'Hello World' │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 'Hello World' │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 'Hello World' │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 'Hello World' │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 'Hello World' │
└───────────────┘

Performance comparison

Results of timeit test for each method (dfs checked for equality):

       method  timeit (s)      perc
0  pandas_new    1.092429  1.000000
1  pandas_old    1.553892  1.422419
2  polars_new    0.151107  0.138322
3  polars_old    1.851840  1.695158

As you can see, both new methods for pandas and polars are faster than the original methods, and the polars method is a clear winner, taking only 13.8% of the new pandas method.

So, why is Series.str.replace (or: str.strip) so much faster than Series.apply? The reason has to do with the fact that the former performs an operator on an entire Series (e.g. a "column") all at once ("vectorization"), while the latter calls a Python function for each element in the Series separately. E.g. lambda x: cleanse_text(x) means: apply a UDF (user-defined function) to 1st element in column, 2nd element in column, etc. On larger sets, this will make a huge difference. Cf. also the documentation for pl.DataFrame.apply.

Answered By: ouroboros1