polars equivalent to groupby.last

Question:

Say i have a polars dataframe:

    import polars as pl 
    df = pl.DataFrame({'index': [1,2,3,2,1],
                       'object': [1, 1, 1, 2, 2], 
                      'period': [1, 2, 4, 4, 23],
                      'value': [24, 67, 89, 5, 23],
                      })

How would i get dict of index -> to last value
df.col('value').last().over(['index']).alias("last") is the last value but that requires a lot of extra computation and more work to get to the key value pairs.

Asked By: Michael WS

||

Answers:

The over function will keep all rows, which is probably not what you want. An easy way to get just the last value for index is to use unique.

(
    df
    .select(['index', 'value'])
    .unique(subset='index', keep="last")
)
shape: (3, 2)
┌───────┬───────┐
│ index ┆ value │
│ ---   ┆ ---   │
│ i64   ┆ i64   │
╞═══════╪═══════╡
│ 1     ┆ 23    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 89    │
└───────┴───────┘

From this point, you can use the to_dicts method to convert the DataFrame to a list of dictionaries.

last_values = (
    df
    .select(["index", "value"])
    .unique(subset="index", keep="last")
    .to_dicts()
)
last_values
[{'index': 1, 'value': 23}, {'index': 2, 'value': 5}, {'index': 3, 'value': 89}]

If you are looking to later import this into a DataFrame, you’ll want to stop at this point. For example:

pl.DataFrame(last_values)
>>> pl.DataFrame(last_values)
shape: (3, 2)
┌───────┬───────┐
│ index ┆ value │
│ ---   ┆ ---   │
│ i64   ┆ i64   │
╞═══════╪═══════╡
│ 1     ┆ 23    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 89    │
└───────┴───────┘

However, if you want to collapse this into a single dictionary of index:value pairs, you can use a dictionary comprehension.


{
    next_dict["index"]: next_dict["value"]
    for next_dict in last_values
}
{1: 23, 2: 5, 3: 89}

Edit: Updating based on date

Let’s assume that we have this data:

import polars as pl
import datetime

df = pl.DataFrame({
    "index": [1, 2, 3],
    "value": [10, 20, 30],
}).join(
    pl.DataFrame({
        'date': pl.date_range(datetime.date(2021, 1, 1), datetime.date(2023, 1, 1), "1y")
    }),
    how="cross"
)
df
shape: (9, 3)
┌───────┬───────┬────────────┐
│ index ┆ value ┆ date       │
│ ---   ┆ ---   ┆ ---        │
│ i64   ┆ i64   ┆ date       │
╞═══════╪═══════╪════════════╡
│ 1     ┆ 10    ┆ 2021-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1     ┆ 10    ┆ 2022-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1     ┆ 10    ┆ 2023-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2     ┆ 20    ┆ 2021-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2     ┆ 20    ┆ 2022-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2     ┆ 20    ┆ 2023-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3     ┆ 30    ┆ 2021-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3     ┆ 30    ┆ 2022-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3     ┆ 30    ┆ 2023-01-01 │
└───────┴───────┴────────────┘

And we have these values that we want to update.

update_df = pl.DataFrame({
    "index": [2, 3],
    "value": [200, 300],
})
update_df
shape: (2, 2)
┌───────┬───────┐
│ index ┆ value │
│ ---   ┆ ---   │
│ i64   ┆ i64   │
╞═══════╪═══════╡
│ 2     ┆ 200   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 300   │
└───────┴───────┘

Note: I’ve purposely left out index "1" (to show what will happen).

If we want to update the value associated with each index, but only beyond a certain date, we can use a join_asof.

Since this is an advanced method, we’ll take it in steps.

We’ll add the current_date to the update_df as a literal. (The same value for all rows.)

We also need to make sure both our DataFrames are sorted by the "as_of" column (date, not index). (update_df will already be sorted because its the same date on each row.)

I’ll also sort after the join_asof so that we can see what is happening more clearly. (You don’t need to do this step.)

current_date = datetime.date(2022, 1, 1)
(
    df
    .sort(['date'])
    .rename({'value': 'prev_value'})
    .join_asof(
        update_df.with_column(pl.lit(current_date).alias('date')),
        on='date',
        by=['index'],
        strategy='backward'
    )
    .sort(['index', 'date'])
)
shape: (9, 4)
┌───────┬────────────┬────────────┬───────┐
│ index ┆ prev_value ┆ date       ┆ value │
│ ---   ┆ ---        ┆ ---        ┆ ---   │
│ i64   ┆ i64        ┆ date       ┆ i64   │
╞═══════╪════════════╪════════════╪═══════╡
│ 1     ┆ 10         ┆ 2021-01-01 ┆ null  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1     ┆ 10         ┆ 2022-01-01 ┆ null  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1     ┆ 10         ┆ 2023-01-01 ┆ null  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 20         ┆ 2021-01-01 ┆ null  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 20         ┆ 2022-01-01 ┆ 200   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 20         ┆ 2023-01-01 ┆ 200   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 30         ┆ 2021-01-01 ┆ null  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 30         ┆ 2022-01-01 ┆ 300   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 30         ┆ 2023-01-01 ┆ 300   │
└───────┴────────────┴────────────┴───────┘

Notice that only those rows with a date >= 2022-01-01 have a non-null value for value. (I’ll show how to do a > 2022-01-01 at the end.)

Next we’ll use fill_null to fill the null values in value with the prev_value column.

current_date = datetime.date(2022, 1, 1)
(
    df
    .sort(['date'])
    .rename({'value': 'prev_value'})
    .join_asof(
        update_df.with_column(pl.lit(current_date).alias('date')),
        on='date',
        by=['index'],
        strategy='backward'
    )
    .sort(['index', 'date'])
    .with_column(pl.col('value').fill_null(pl.col('prev_value')))
)
shape: (9, 4)
┌───────┬────────────┬────────────┬───────┐
│ index ┆ prev_value ┆ date       ┆ value │
│ ---   ┆ ---        ┆ ---        ┆ ---   │
│ i64   ┆ i64        ┆ date       ┆ i64   │
╞═══════╪════════════╪════════════╪═══════╡
│ 1     ┆ 10         ┆ 2021-01-01 ┆ 10    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1     ┆ 10         ┆ 2022-01-01 ┆ 10    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1     ┆ 10         ┆ 2023-01-01 ┆ 10    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 20         ┆ 2021-01-01 ┆ 20    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 20         ┆ 2022-01-01 ┆ 200   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 20         ┆ 2023-01-01 ┆ 200   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 30         ┆ 2021-01-01 ┆ 30    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 30         ┆ 2022-01-01 ┆ 300   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 30         ┆ 2023-01-01 ┆ 300   │
└───────┴────────────┴────────────┴───────┘

Now, to clean up, we can drop the prev_value column, and re-arrange the columns.

current_date = datetime.date(2022, 1, 1)
(
    df
    .sort(['date'])
    .rename({'value': 'prev_value'})
    .join_asof(
        update_df.with_column(pl.lit(current_date).alias('date')),
        on='date',
        by=['index'],
        strategy='backward'
    )
    .sort(['index', 'date'])
    .with_column(pl.col('value').fill_null(pl.col('prev_value')))
    .drop(['prev_value'])
    .select([
        pl.exclude('date'),
        pl.col('date')
    ])
)
shape: (9, 3)
┌───────┬───────┬────────────┐
│ index ┆ value ┆ date       │
│ ---   ┆ ---   ┆ ---        │
│ i64   ┆ i64   ┆ date       │
╞═══════╪═══════╪════════════╡
│ 1     ┆ 10    ┆ 2021-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1     ┆ 10    ┆ 2022-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1     ┆ 10    ┆ 2023-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2     ┆ 20    ┆ 2021-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2     ┆ 200   ┆ 2022-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2     ┆ 200   ┆ 2023-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3     ┆ 30    ┆ 2021-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3     ┆ 300   ┆ 2022-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3     ┆ 300   ┆ 2023-01-01 │
└───────┴───────┴────────────┘

If you need to update only those rows that are strictly greater than current_date, you can simply add one day to your current_date. Polars makes this easy with the offset_by expression.

(
    df
    .sort(['date'])
    .rename({'value': 'prev_value'})
    .join_asof(
        update_df.with_column(pl.lit(current_date).dt.offset_by('1d').alias('date')),
        on='date',
        by=['index'],
        strategy='backward'
    )
    .sort(['index', 'date'])
    .with_column(pl.col('value').fill_null(pl.col('prev_value')))
    .drop(['prev_value'])
    .select([
        pl.exclude('date'),
        pl.col('date')
    ])
)
shape: (9, 3)
┌───────┬───────┬────────────┐
│ index ┆ value ┆ date       │
│ ---   ┆ ---   ┆ ---        │
│ i64   ┆ i64   ┆ date       │
╞═══════╪═══════╪════════════╡
│ 1     ┆ 10    ┆ 2021-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1     ┆ 10    ┆ 2022-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1     ┆ 10    ┆ 2023-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2     ┆ 20    ┆ 2021-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2     ┆ 20    ┆ 2022-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2     ┆ 200   ┆ 2023-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3     ┆ 30    ┆ 2021-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3     ┆ 30    ┆ 2022-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3     ┆ 300   ┆ 2023-01-01 │
└───────┴───────┴────────────┘
Answered By: cbilot
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.