polars slower than numpy?

Question:

I was thinking about using polars in place of numpy in a parsing problem where I turn a structured text file into a character table and operate on different columns. However, it seems that polars is about 5 times slower than numpy in most operations I’m performing. I was wondering why that’s the case and whether I’m doing something wrong given that polars is supposed to be faster.

Example:

import requests
import numpy as np
import polars as pl

# Download the text file
text = requests.get("https://files.rcsb.org/download/3w32.pdb").text

# Turn it into a 2D array of characters
char_tab_np = np.array(file.splitlines()).view(dtype=(str,1)).reshape(-1, 80)

# Create a polars DataFrame from the numpy array
char_tab_pl = pl.DataFrame(char_tab_np)

# Sort by first column with numpy
char_tab_np[np.argsort(char_tab_np[:,0])]

# Sort by first column with polars
char_tab_pl.sort(by="column_0")

Using %%timeit in Jupyter, the numpy sorting takes about 320 microseconds, whereas the polars sort takes about 1.3 milliseconds, i.e. about five times slower.

I also tried char_tab_pl.lazy().sort(by="column_0").collect(), but it had no effect on the duration.

Another example (Take all rows where the first column is equal to ‘A’):

# with numpy
%%timeit
char_tab_np[char_tab_np[:, 0] == "A"]
# with polars
%%timeit
char_tab_pl.filter(pl.col("column_0") == "A")

Again, numpy takes 226 microseconds, whereas polars takes 673 microseconds, about three times slower.

Update

Based on the comments I tried two other things:

1. Making the file 1000 times larger to see whether polars performs better on larger data.

Results: numpy was still about 2 times faster (1.3 ms vs. 2.1 ms). In addition, creating the character array took numpy about 2 seconds, whereas polars needed about 2 minutes to create the dataframe, i.e. 60 times slower.

To re-produce, just add text *= 1000 before creating the numpy array in the code above.

2. Casting to integer.

For the original (smaller) file, casting to int sped up the process for both numpy and polars. The filtering in numpy was still about 5 times faster than polars (30 microseconds vs. 120), wheres the sorting time became more similar (150 microseconds for numpy vs. 200 for polars).

However, for the large file, polars was marginally faster than numpy, but the huge instantiation time makes it only worth if the dataframe is to be queried thousands of times.

Asked By: Qunatized

||

Answers:

Polars does extra work in filtering string data that is not worth it in this case. Polars uses arrow large-utf8 buffers for their string data. This makes filtering more expensive than filtering python strings/chars (e.g. pointers or u8 bytes).

Sometimes it is worth it, sometimes not. If you have homogeneous data, numpy is a better fit than polars. If you have heterogenous data, polars will likely be faster. Especially if you consider your whole query instead of these micro benchmarks.

Answered By: ritchie46
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.