how to imitate Pandas' index-based querying in Polars?

Question:

Any idea what I can do to imitate the below pandas code using polars? Polars doesn’t have indexes like pandas so I couldn’t figure out what I can do .

df = pd.DataFrame(data = ([21,123], [132,412], [23, 43]), columns = ['c1', 'c2']).set_index("c1")
print(df.loc[[23, 132]])

and it prints

c1 c2
23 43
132 412

the only polars conversion I could figure out to do is

df = pl.DataFrame(data = ([21,123], [132,412], [23, 43]), schema = ['c1', 'c2'])
print(df.filter(pl.col("c1").is_in([23, 132])))

but it prints

c1 c2
132 412
23 43

which is okay but the rows are not in the order I gave. I gave [23, 132] and want the output rows to be in the same order, like how pandas’ output has.

I can use a sort() later yes, but the original data I use this on has like 30Million rows so I’m looking for something that’s as fast as possible.

Asked By: RacGreen1

||

Answers:

I suggest using a left join to accomplish this. This will maintain the order corresponding to your list of index values. (And it is quite performant.)

For example, let’s start with this shuffled DataFrame.

nbr_rows = 30_000_000

df = pl.DataFrame({
    'c1': pl.arange(0, nbr_rows, eager=True).shuffle(2),
    'c2': pl.arange(0, nbr_rows, eager=True).shuffle(3),
})
df
shape: (30000000, 2)
┌──────────┬──────────┐
│ c1       ┆ c2       │
│ ---      ┆ ---      │
│ i64      ┆ i64      │
╞══════════╪══════════╡
│ 4052015  ┆ 20642741 │
│ 7787054  ┆ 17007051 │
│ 20246150 ┆ 19445431 │
│ 1309992  ┆ 6495751  │
│ ...      ┆ ...      │
│ 10371090 ┆ 4791782  │
│ 26281644 ┆ 12350777 │
│ 6740626  ┆ 24888572 │
│ 22573405 ┆ 14885989 │
└──────────┴──────────┘

And these index values:

nbr_index_values = 10_000
s1 = pl.Series(name='c1', values=pl.arange(0, nbr_index_values, eager=True).shuffle())
s1
shape: (10000,)
Series: 'c1' [i64]
[
        1754
        6716
        3485
        7058
        7216
        1040
        1832
        3921
        1639
        6734
        5560
        7596
        ...
        4243
        4455
        894
        7806
        9291
        1883
        9947
        3309
        2030
        7731
        4706
        8528
        8426
]

We now perform a left join to obtain the rows corresponding to the index values. (Note that the list of index values is the left DataFrame in this join.)

start = time.perf_counter()
df2 = (
    s1.to_frame()
    .join(
        df,
        on='c1',
        how='left'
    )
)
print(time.perf_counter() - start)

df2
>>> print(time.perf_counter() - start)
0.8427023889998964
shape: (10000, 2)
┌──────┬──────────┐
│ c1   ┆ c2       │
│ ---  ┆ ---      │
│ i64  ┆ i64      │
╞══════╪══════════╡
│ 1754 ┆ 15734441 │
│ 6716 ┆ 20631535 │
│ 3485 ┆ 20199121 │
│ 7058 ┆ 15881128 │
│ ...  ┆ ...      │
│ 7731 ┆ 19420197 │
│ 4706 ┆ 16918008 │
│ 8528 ┆ 5278904  │
│ 8426 ┆ 18927935 │
└──────┴──────────┘

Notice how the rows are in the same order as the index values. We can verify this:

s1.series_equal(df2.get_column('c1'), strict=True)
>>> s1.series_equal(df2.get_column('c1'), strict=True)
True

And the performance is quite good. On my 32-core system, this takes less than a second.