Split value between polars DataFrame rows
Question:
I would like to find a way to distribute the values of a DataFrame among the rows of another DataFrame using polars (without iterating through the rows).
I have a dataframe with the amounts to be distributed:
Name
Amount
A
100
B
300
C
250
And a target DataFrame to which I want to append the distributed values (in a new column) using the common "Name" column.
Name
Item
Price
A
x1
40
A
x2
60
B
y1
50
B
y2
150
B
y3
200
C
z1
400
The rows in the target are sorted and the assigned amount should match the price in each row (as long as there is enough amount remaining).
So the result in this case should look like this:
Name
Item
Price
Assigned amount
A
x1
40
40
A
x2
60
60
B
y1
50
50
B
y2
150
150
B
y3
200
100
C
z1
400
250
In this example, we can distribute the amounts for A, so that they are the same as the price. However, for the last item of B and for C we write the remaining amounts as the prices are too high.
Is there an efficient way to do this?
My initial solution was to calculate the cumulative sum of the Price in a new column in the target dataframe, then left join the source DataFrame and subtract the values of the cumulative sum. This would work if the amount is high enough, but for the last item of B and C I would get negative values and not the remaining amount.
Edit
Example dataframes:
import polars as pl
df1 = pl.DataFrame({"Name": ["A", "B", "C"], "Amount": [100, 300, 250]})
df2 = pl.DataFrame({"Name": ["A", "A", "B", "B", "B", "C"], "Item": ["x1", "x2", "y1", "y2", "y3", "z"],"Price": [40, 60, 50, 150, 200, 400]})
Answers:
This assumes the order of the df is the order of priority, if not, sort it first.
You first want to join your two dfs then make a helper column that is the cumsum
of Price
less Price
. I call that spent
. It’s more like a potential spent because there’s no guarantee it doesn’t go over Amount
.
Add another two helper columns, one for the difference between Amount
and spent
which we’ll call have1
as that’s the amount we have. In the sample data this didn’t come up but we need to make sure this isn’t less than 0 so we add another column which is just literally zero, we’ll call it z
.
Add another helper column which will be the greater value between 0 and have1
and we’ll call it have2
.
Lastly, we’ll determine the Assigned amount
as smaller value between have2
and Price
.
df1.join(df2, on='Name')
.with_columns((pl.col("Price").cumsum()-pl.col("Price")).over("Name").alias("spent"))
.with_columns([(pl.col("Amount")-pl.col("spent")).alias("have1"), pl.lit(0).alias('z')])
.with_columns(pl.concat_list([pl.col('z'), pl.col('have1')]).arr.max().alias('have2'))
.with_columns(pl.concat_list([pl.col('have2'), pl.col("Price")]).arr.min().alias("Assigned amount"))
.select(["Name", "Item","Price","Assigned amount"])
You can reduce this to a single nested expression like this…
df1.join(df2, on='Name')
.select(["Name", "Item","Price",
pl.concat_list([
pl.concat_list([
pl.repeat(0, pl.count()),
pl.col("Amount")-(pl.col("Price").cumsum()-pl.col("Price")).over("Name")
]).arr.max(),
pl.col("Price")
]).arr.min().alias("Assigned amount")
])
shape: (6, 4)
┌──────┬──────┬───────┬─────────────────┐
│ Name ┆ Item ┆ Price ┆ Assigned amount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
╞══════╪══════╪═══════╪═════════════════╡
│ A ┆ x1 ┆ 40 ┆ 40 │
│ A ┆ x2 ┆ 60 ┆ 60 │
│ B ┆ y1 ┆ 50 ┆ 50 │
│ B ┆ y2 ┆ 150 ┆ 150 │
│ B ┆ y3 ┆ 200 ┆ 100 │
│ C ┆ z ┆ 400 ┆ 250 │
└──────┴──────┴───────┴─────────────────┘
You can take the minimum value of the Price or the Difference.
.clip_min(0)
can be used to replace the negatives.
[Edit: See @ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ’s answer for a neater way to write this.]
(
df2
.join(df1, on="Name")
.with_columns(
cumsum = pl.col("Price").cumsum().over("Name"))
.with_columns(
assigned = pl.col("Amount") - (pl.col("cumsum") - pl.col("Price")))
.with_columns(
assigned = pl.min(["Price", "assigned"]).clip_min(0))
)
shape: (6, 6)
┌──────┬──────┬───────┬────────┬────────┬──────────┐
│ Name | Item | Price | Amount | cumsum | assigned │
│ --- | --- | --- | --- | --- | --- │
│ str | str | i64 | i64 | i64 | i64 │
╞══════╪══════╪═══════╪════════╪════════╪══════════╡
│ A | x1 | 40 | 100 | 40 | 40 │
│ A | x2 | 60 | 100 | 100 | 60 │
│ B | y1 | 50 | 300 | 50 | 50 │
│ B | y2 | 150 | 300 | 200 | 150 │
│ B | y3 | 200 | 300 | 400 | 100 │
│ C | z | 400 | 250 | 400 | 250 │
└──────┴──────┴───────┴────────┴────────┴──────────┘
@jqurious, good answer. This might be slightly more succinct:
(
df2.join(df1, on="Name")
.with_columns(
pl.min([
pl.col('Price'),
pl.col('Amount') -
pl.col('Price').cumsum().shift_and_fill(1, 0).over('Name')
])
.clip_min(0)
.alias('assigned')
)
)
shape: (6, 5)
┌──────┬──────┬───────┬────────┬──────────┐
│ Name ┆ Item ┆ Price ┆ Amount ┆ assigned │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪═══════╪════════╪══════════╡
│ A ┆ x1 ┆ 40 ┆ 100 ┆ 40 │
│ A ┆ x2 ┆ 60 ┆ 100 ┆ 60 │
│ B ┆ y1 ┆ 50 ┆ 300 ┆ 50 │
│ B ┆ y2 ┆ 150 ┆ 300 ┆ 150 │
│ B ┆ y3 ┆ 200 ┆ 300 ┆ 100 │
│ C ┆ z ┆ 400 ┆ 250 ┆ 250 │
└──────┴──────┴───────┴────────┴──────────┘
I would like to find a way to distribute the values of a DataFrame among the rows of another DataFrame using polars (without iterating through the rows).
I have a dataframe with the amounts to be distributed:
Name | Amount |
---|---|
A | 100 |
B | 300 |
C | 250 |
And a target DataFrame to which I want to append the distributed values (in a new column) using the common "Name" column.
Name | Item | Price |
---|---|---|
A | x1 | 40 |
A | x2 | 60 |
B | y1 | 50 |
B | y2 | 150 |
B | y3 | 200 |
C | z1 | 400 |
The rows in the target are sorted and the assigned amount should match the price in each row (as long as there is enough amount remaining).
So the result in this case should look like this:
Name | Item | Price | Assigned amount |
---|---|---|---|
A | x1 | 40 | 40 |
A | x2 | 60 | 60 |
B | y1 | 50 | 50 |
B | y2 | 150 | 150 |
B | y3 | 200 | 100 |
C | z1 | 400 | 250 |
In this example, we can distribute the amounts for A, so that they are the same as the price. However, for the last item of B and for C we write the remaining amounts as the prices are too high.
Is there an efficient way to do this?
My initial solution was to calculate the cumulative sum of the Price in a new column in the target dataframe, then left join the source DataFrame and subtract the values of the cumulative sum. This would work if the amount is high enough, but for the last item of B and C I would get negative values and not the remaining amount.
Edit
Example dataframes:
import polars as pl
df1 = pl.DataFrame({"Name": ["A", "B", "C"], "Amount": [100, 300, 250]})
df2 = pl.DataFrame({"Name": ["A", "A", "B", "B", "B", "C"], "Item": ["x1", "x2", "y1", "y2", "y3", "z"],"Price": [40, 60, 50, 150, 200, 400]})
This assumes the order of the df is the order of priority, if not, sort it first.
You first want to join your two dfs then make a helper column that is the cumsum
of Price
less Price
. I call that spent
. It’s more like a potential spent because there’s no guarantee it doesn’t go over Amount
.
Add another two helper columns, one for the difference between Amount
and spent
which we’ll call have1
as that’s the amount we have. In the sample data this didn’t come up but we need to make sure this isn’t less than 0 so we add another column which is just literally zero, we’ll call it z
.
Add another helper column which will be the greater value between 0 and have1
and we’ll call it have2
.
Lastly, we’ll determine the Assigned amount
as smaller value between have2
and Price
.
df1.join(df2, on='Name')
.with_columns((pl.col("Price").cumsum()-pl.col("Price")).over("Name").alias("spent"))
.with_columns([(pl.col("Amount")-pl.col("spent")).alias("have1"), pl.lit(0).alias('z')])
.with_columns(pl.concat_list([pl.col('z'), pl.col('have1')]).arr.max().alias('have2'))
.with_columns(pl.concat_list([pl.col('have2'), pl.col("Price")]).arr.min().alias("Assigned amount"))
.select(["Name", "Item","Price","Assigned amount"])
You can reduce this to a single nested expression like this…
df1.join(df2, on='Name')
.select(["Name", "Item","Price",
pl.concat_list([
pl.concat_list([
pl.repeat(0, pl.count()),
pl.col("Amount")-(pl.col("Price").cumsum()-pl.col("Price")).over("Name")
]).arr.max(),
pl.col("Price")
]).arr.min().alias("Assigned amount")
])
shape: (6, 4)
┌──────┬──────┬───────┬─────────────────┐
│ Name ┆ Item ┆ Price ┆ Assigned amount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
╞══════╪══════╪═══════╪═════════════════╡
│ A ┆ x1 ┆ 40 ┆ 40 │
│ A ┆ x2 ┆ 60 ┆ 60 │
│ B ┆ y1 ┆ 50 ┆ 50 │
│ B ┆ y2 ┆ 150 ┆ 150 │
│ B ┆ y3 ┆ 200 ┆ 100 │
│ C ┆ z ┆ 400 ┆ 250 │
└──────┴──────┴───────┴─────────────────┘
You can take the minimum value of the Price or the Difference.
.clip_min(0)
can be used to replace the negatives.
[Edit: See @ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ’s answer for a neater way to write this.]
(
df2
.join(df1, on="Name")
.with_columns(
cumsum = pl.col("Price").cumsum().over("Name"))
.with_columns(
assigned = pl.col("Amount") - (pl.col("cumsum") - pl.col("Price")))
.with_columns(
assigned = pl.min(["Price", "assigned"]).clip_min(0))
)
shape: (6, 6)
┌──────┬──────┬───────┬────────┬────────┬──────────┐
│ Name | Item | Price | Amount | cumsum | assigned │
│ --- | --- | --- | --- | --- | --- │
│ str | str | i64 | i64 | i64 | i64 │
╞══════╪══════╪═══════╪════════╪════════╪══════════╡
│ A | x1 | 40 | 100 | 40 | 40 │
│ A | x2 | 60 | 100 | 100 | 60 │
│ B | y1 | 50 | 300 | 50 | 50 │
│ B | y2 | 150 | 300 | 200 | 150 │
│ B | y3 | 200 | 300 | 400 | 100 │
│ C | z | 400 | 250 | 400 | 250 │
└──────┴──────┴───────┴────────┴────────┴──────────┘
@jqurious, good answer. This might be slightly more succinct:
(
df2.join(df1, on="Name")
.with_columns(
pl.min([
pl.col('Price'),
pl.col('Amount') -
pl.col('Price').cumsum().shift_and_fill(1, 0).over('Name')
])
.clip_min(0)
.alias('assigned')
)
)
shape: (6, 5)
┌──────┬──────┬───────┬────────┬──────────┐
│ Name ┆ Item ┆ Price ┆ Amount ┆ assigned │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪═══════╪════════╪══════════╡
│ A ┆ x1 ┆ 40 ┆ 100 ┆ 40 │
│ A ┆ x2 ┆ 60 ┆ 100 ┆ 60 │
│ B ┆ y1 ┆ 50 ┆ 300 ┆ 50 │
│ B ┆ y2 ┆ 150 ┆ 300 ┆ 150 │
│ B ┆ y3 ┆ 200 ┆ 300 ┆ 100 │
│ C ┆ z ┆ 400 ┆ 250 ┆ 250 │
└──────┴──────┴───────┴────────┴──────────┘