Split value between polars DataFrame rows

Question:

I would like to find a way to distribute the values of a DataFrame among the rows of another DataFrame using polars (without iterating through the rows).

I have a dataframe with the amounts to be distributed:

Name Amount
A 100
B 300
C 250

And a target DataFrame to which I want to append the distributed values (in a new column) using the common "Name" column.

Name Item Price
A x1 40
A x2 60
B y1 50
B y2 150
B y3 200
C z1 400

The rows in the target are sorted and the assigned amount should match the price in each row (as long as there is enough amount remaining).

So the result in this case should look like this:

Name Item Price Assigned amount
A x1 40 40
A x2 60 60
B y1 50 50
B y2 150 150
B y3 200 100
C z1 400 250

In this example, we can distribute the amounts for A, so that they are the same as the price. However, for the last item of B and for C we write the remaining amounts as the prices are too high.

Is there an efficient way to do this?

My initial solution was to calculate the cumulative sum of the Price in a new column in the target dataframe, then left join the source DataFrame and subtract the values of the cumulative sum. This would work if the amount is high enough, but for the last item of B and C I would get negative values and not the remaining amount.

Edit

Example dataframes:

import polars as pl

df1 = pl.DataFrame({"Name": ["A", "B", "C"], "Amount": [100, 300, 250]})
df2 = pl.DataFrame({"Name": ["A", "A", "B", "B", "B", "C"], "Item": ["x1", "x2", "y1", "y2", "y3", "z"],"Price": [40, 60, 50, 150, 200, 400]})
Asked By: szuperpingvin

||

Answers:

This assumes the order of the df is the order of priority, if not, sort it first.

You first want to join your two dfs then make a helper column that is the cumsum of Price less Price. I call that spent. It’s more like a potential spent because there’s no guarantee it doesn’t go over Amount.

Add another two helper columns, one for the difference between Amount and spent which we’ll call have1 as that’s the amount we have. In the sample data this didn’t come up but we need to make sure this isn’t less than 0 so we add another column which is just literally zero, we’ll call it z.

Add another helper column which will be the greater value between 0 and have1 and we’ll call it have2.

Lastly, we’ll determine the Assigned amount as smaller value between have2 and Price.

df1.join(df2, on='Name') 
    .with_columns((pl.col("Price").cumsum()-pl.col("Price")).over("Name").alias("spent")) 
    .with_columns([(pl.col("Amount")-pl.col("spent")).alias("have1"), pl.lit(0).alias('z')]) 
    .with_columns(pl.concat_list([pl.col('z'), pl.col('have1')]).arr.max().alias('have2')) 
    .with_columns(pl.concat_list([pl.col('have2'), pl.col("Price")]).arr.min().alias("Assigned amount")) 
    .select(["Name", "Item","Price","Assigned amount"])

You can reduce this to a single nested expression like this…

df1.join(df2, on='Name') 
    .select(["Name", "Item","Price",
        pl.concat_list([
            pl.concat_list([
                pl.repeat(0, pl.count()), 
                pl.col("Amount")-(pl.col("Price").cumsum()-pl.col("Price")).over("Name")
            ]).arr.max(), 
            pl.col("Price")
        ]).arr.min().alias("Assigned amount")
    ])


shape: (6, 4)
┌──────┬──────┬───────┬─────────────────┐
│ Name ┆ Item ┆ Price ┆ Assigned amount │
│ ---  ┆ ---  ┆ ---   ┆ ---             │
│ str  ┆ str  ┆ i64   ┆ i64             │
╞══════╪══════╪═══════╪═════════════════╡
│ A    ┆ x1   ┆ 40    ┆ 40              │
│ A    ┆ x2   ┆ 60    ┆ 60              │
│ B    ┆ y1   ┆ 50    ┆ 50              │
│ B    ┆ y2   ┆ 150   ┆ 150             │
│ B    ┆ y3   ┆ 200   ┆ 100             │
│ C    ┆ z    ┆ 400   ┆ 250             │
└──────┴──────┴───────┴─────────────────┘
Answered By: Dean MacGregor

You can take the minimum value of the Price or the Difference.

.clip_min(0) can be used to replace the negatives.

[Edit: See @ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ’s answer for a neater way to write this.]

(
   df2
   .join(df1, on="Name")
   .with_columns(
      cumsum = pl.col("Price").cumsum().over("Name"))
   .with_columns(
      assigned = pl.col("Amount") - (pl.col("cumsum") - pl.col("Price")))
   .with_columns(
      assigned = pl.min(["Price", "assigned"]).clip_min(0))
)
shape: (6, 6)
┌──────┬──────┬───────┬────────┬────────┬──────────┐
│ Name | Item | Price | Amount | cumsum | assigned │
│ ---  | ---  | ---   | ---    | ---    | ---      │
│ str  | str  | i64   | i64    | i64    | i64      │
╞══════╪══════╪═══════╪════════╪════════╪══════════╡
│ A    | x1   | 40    | 100    | 40     | 40       │
│ A    | x2   | 60    | 100    | 100    | 60       │
│ B    | y1   | 50    | 300    | 50     | 50       │
│ B    | y2   | 150   | 300    | 200    | 150      │
│ B    | y3   | 200   | 300    | 400    | 100      │
│ C    | z    | 400   | 250    | 400    | 250      │
└──────┴──────┴───────┴────────┴────────┴──────────┘
Answered By: jqurious

@jqurious, good answer. This might be slightly more succinct:

(
    df2.join(df1, on="Name")
    .with_columns(
        pl.min([
            pl.col('Price'),
            pl.col('Amount') -
            pl.col('Price').cumsum().shift_and_fill(1, 0).over('Name')
        ])
        .clip_min(0)
        .alias('assigned')
    )
)
shape: (6, 5)
┌──────┬──────┬───────┬────────┬──────────┐
│ Name ┆ Item ┆ Price ┆ Amount ┆ assigned │
│ ---  ┆ ---  ┆ ---   ┆ ---    ┆ ---      │
│ str  ┆ str  ┆ i64   ┆ i64    ┆ i64      │
╞══════╪══════╪═══════╪════════╪══════════╡
│ A    ┆ x1   ┆ 40    ┆ 100    ┆ 40       │
│ A    ┆ x2   ┆ 60    ┆ 100    ┆ 60       │
│ B    ┆ y1   ┆ 50    ┆ 300    ┆ 50       │
│ B    ┆ y2   ┆ 150   ┆ 300    ┆ 150      │
│ B    ┆ y3   ┆ 200   ┆ 300    ┆ 100      │
│ C    ┆ z    ┆ 400   ┆ 250    ┆ 250      │
└──────┴──────┴───────┴────────┴──────────┘
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.