Repeat rows in a Polars DataFrame based on column value

Question:

I would like to expand the following Polars dataframe by repeating rows based on values in the quantity column.

Original DataFrame:

Fruit Quantity
Apple 2
Banana 3

Expected Output:

Fruit Quantity
Apple 1
Apple 1
Banana 1
Banana 1
Banana 1

Here is a very similar question but using Pandas instead of Polars: Repeat rows in a pandas DataFrame based on column value

The polars repeat function does not seem to offer the same functionality as its Pandas counterpart: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.repeat.html

Asked By: NFern

||

Answers:

You were close. What you were looking for was the repeat_by expression.

First some data. I’m going to add an ID column, just to show how to apply the repeat_by expression to multiple columns (but exclude Quantity).

import polars as pl

df = (
    pl.DataFrame({
        'ID' : [100, 200],
        'Fruit': ["Apple", "Banana"],
        'Quantity': [2, 3],
    })
)
df
shape: (2, 3)
┌─────┬────────┬──────────┐
│ ID  ┆ Fruit  ┆ Quantity │
│ --- ┆ ---    ┆ ---      │
│ i64 ┆ str    ┆ i64      │
╞═════╪════════╪══════════╡
│ 100 ┆ Apple  ┆ 2        │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 200 ┆ Banana ┆ 3        │
└─────┴────────┴──────────┘

The Algorithm

(
    df
    .select(
        pl.exclude('Quantity').repeat_by('Quantity').explode()
    )
    .with_column(
        pl.lit(1).alias('Quantity')
    )
)
shape: (5, 3)
┌─────┬────────┬──────────┐
│ ID  ┆ Fruit  ┆ Quantity │
│ --- ┆ ---    ┆ ---      │
│ i64 ┆ str    ┆ i32      │
╞═════╪════════╪══════════╡
│ 100 ┆ Apple  ┆ 1        │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 100 ┆ Apple  ┆ 1        │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 200 ┆ Banana ┆ 1        │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 200 ┆ Banana ┆ 1        │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 200 ┆ Banana ┆ 1        │
└─────┴────────┴──────────┘

How it works

The repeat_by expression will repeat a value in a Series by the value in another column/expression. In this case, we want to repeat by the value in Quantity.

We’ll also use the exclude expression to apply repeat_by to all columns except Quantity (which we’ll replace later).

Note that the result of repeat_by is a list.

(
    df
    .select(
        pl.exclude('Quantity').repeat_by('Quantity')
    )
)
shape: (2, 2)
┌─────────────────┬────────────────────────────────┐
│ ID              ┆ Fruit                          │
│ ---             ┆ ---                            │
│ list[i64]       ┆ list[str]                      │
╞═════════════════╪════════════════════════════════╡
│ [100, 100]      ┆ ["Apple", "Apple"]             │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [200, 200, 200] ┆ ["Banana", "Banana", "Banana"] │
└─────────────────┴────────────────────────────────┘

Next, we use explode, which will take each element of each list and place it on its own row.

(
    df
    .select(
        pl.exclude('Quantity').repeat_by('Quantity').explode()
    )
)
shape: (5, 2)
┌─────┬────────┐
│ ID  ┆ Fruit  │
│ --- ┆ ---    │
│ i64 ┆ str    │
╞═════╪════════╡
│ 100 ┆ Apple  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 100 ┆ Apple  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 200 ┆ Banana │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 200 ┆ Banana │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 200 ┆ Banana │
└─────┴────────┘

From there, we use the lit expression to add Quantity back to the DataFrame.

Answered By: cbilot