Generate a running tally with Polars where the closing balance one month becomes the opening balance for the next month

Question:

I want to generate a simulation of a running tally over a period of time given an initial opening balance. By example, an imaginary business starts with $1000 and experiences a random profit and loss during any month. How much might this business have after 12 months?
My issue is that I can generate the needed columns, but cannot figure out how to carry the simulated tally on to the next month without clobbering data in the current month.

I can create a Polars dataframe and populate this with the needed profit and loss columns and generate the the closing balance.
Using stacked ‘with_columns()’, I created the needed columns for the opening and closing balances, and then generate some random profit and losses, then update the closing balance. But I cannot get this to carry forward.

import polars as pl
import datetime as dt
from dateutil.relativedelta import relativedelta
start_date = dt.date.today() + relativedelta(months=1, day=1)
df = pl.DataFrame(
    pl.date_range(start_date, start_date + relativedelta(months=5), '1mo').alias('date'),
)
beginning_balance = 1000.0
df = df.with_columns(
        pl.lit(beginning_balance).alias('beginning_balance'),
        pl.lit(beginning_balance).alias('closing_balance'),
).with_columns(
    [
        pl.when(pl.col('date') == start_date)
        .then(pl.col('beginning_balance'))
        .otherwise(pl.col('closing_balance').shift(1))
        .alias('beginning_balance'), 
        pl.Series([normalvariate(100, 80) for _ in range(len(df))]).round(2).alias('profit'),
        pl.Series([normalvariate(100, 75) for _ in range(len(df))]).round(2).alias('loss'),
    ]
).with_columns(
    (pl.col('beginning_balance') + pl.col('profit') - pl.col('loss')).alias('closing_balance'),
)
df

resulting in :

 shape: (6, 5)
date    beginning_balance   closing_balance     profit  loss
date    f64     f64     f64     f64
2023-04-01  1000.0  909.61  72.97   163.36
2023-05-01  1000.0  1072.29     212.84  140.55
2023-06-01  1000.0  974.54  82.75   108.21
2023-07-01  1000.0  824.08  -60.69  115.23
2023-08-01  1000.0  940.71  155.6   214.89
2023-09-01  1000.0  994.5   94.23   99.73

Note that the closing balance from the first month (909.61) was not carried to the next month, which remained at 1000.00, thus the running tally is not happening.

I can definitely use a loop (for or while) to walk through each row, but this thwarts the optimization and parallelization of Polars. The following would work, but is much slower (especially when I am running many simulations with much more detailed calculations).

current_tally = beginning_balance
for t in range(len(df)):
    beginning_balance = current_tally
    current_tally = beginning_balance + df[t, 'profit'] - df[t, 'loss']
    df[t, 'beginning_balance'] = beginning_balance
    df[t, 'closing_balance'] = current_tally

df
 shape: (6, 5)
date    beginning_balance   closing_balance     profit  loss
date    f64     f64     f64     f64
2023-04-01  1000.0  1063.13     115.49  52.36
2023-05-01  1063.13     1224.37     184.62  23.38
2023-06-01  1224.37     1258.02     114.68  81.03
2023-07-01  1258.02     1323.73     239.12  173.41
2023-08-01  1323.73     1341.33     102.18  84.58
2023-09-01  1341.33     1329.88     19.9    31.35

How would I do this within Polars?

Asked By: Buckley

||

Answers:

I think this is what .cumulative_eval is supposed to solve?

(df
 .with_columns(tally = 
    pl.struct(["profit", "loss"])
      .cumulative_eval(
         pl.element().struct.field("profit").sum() 
          - pl.element().struct.field("loss").sum(),
         parallel=True)
      .shift()
      .fill_null(0))
)
shape: (6, 6)
┌────────────┬───────────────────┬─────────────────┬────────┬────────┬────────┐
│ date       | beginning_balance | closing_balance | profit | loss   | tally  │
│ ---        | ---               | ---             | ---    | ---    | ---    │
│ date       | f64               | f64             | f64    | f64    | f64    │
╞════════════╪═══════════════════╪═════════════════╪════════╪════════╪════════╡
│ 2023-04-01 | 1000.0            | 1141.67         | 188.51 | 46.84  | 0.0    │
│ 2023-05-01 | 1000.0            | 996.48          | 119.3  | 122.82 | 141.67 │
│ 2023-06-01 | 1000.0            | 1004.74         | 184.48 | 179.74 | 138.15 │
│ 2023-07-01 | 1000.0            | 924.85          | 27.21  | 102.36 | 142.89 │
│ 2023-08-01 | 1000.0            | 999.89          | 151.21 | 151.32 | 67.74  │
│ 2023-09-01 | 1000.0            | 964.74          | 102.89 | 138.15 | 67.63  │
└────────────┴───────────────────┴─────────────────┴────────┴────────┴────────┘

You can then add it to both balance columns:

(df
 .with_columns(tally = 
    pl.struct(["profit", "loss"])
      .cumulative_eval(
         pl.element().struct.field("profit").sum() 
          - pl.element().struct.field("loss").sum(),
         parallel=True)
      .shift()
      .fill_null(0))
 .with_columns(
    pl.col("beginning_balance", "closing_balance") + pl.col("tally"))
)
shape: (6, 6)
┌────────────┬───────────────────┬─────────────────┬────────┬────────┬────────┐
│ date       | beginning_balance | closing_balance | profit | loss   | tally  │
│ ---        | ---               | ---             | ---    | ---    | ---    │
│ date       | f64               | f64             | f64    | f64    | f64    │
╞════════════╪═══════════════════╪═════════════════╪════════╪════════╪════════╡
│ 2023-04-01 | 1000.0            | 1141.67         | 188.51 | 46.84  | 0.0    │
│ 2023-05-01 | 1141.67           | 1138.15         | 119.3  | 122.82 | 141.67 │
│ 2023-06-01 | 1138.15           | 1142.89         | 184.48 | 179.74 | 138.15 │
│ 2023-07-01 | 1142.89           | 1067.74         | 27.21  | 102.36 | 142.89 │
│ 2023-08-01 | 1067.74           | 1067.63         | 151.21 | 151.32 | 67.74  │
│ 2023-09-01 | 1067.63           | 1032.37         | 102.89 | 138.15 | 67.63  │
└────────────┴───────────────────┴─────────────────┴────────┴────────┴────────┘
Answered By: jqurious

Alternative solution, which I think is more efficient (since .cumulative_eval is doing a lot of additions multiple times):

Starting dataframe with date and the random profit and loss:

import polars as pl
import datetime as dt
from dateutil.relativedelta import relativedelta

num_months = 6
start_date = dt.date.today() + relativedelta(months=1, day=1)
df = pl.DataFrame().with_columns([
    pl.date_range(start_date, start_date + relativedelta(months=num_months - 1), '1mo').alias('date'),
    pl.Series([normalvariate(100, 80) for _ in range(num_months)]).round(2).alias('profit'),
    pl.Series([normalvariate(100, 75) for _ in range(num_months)]).round(2).alias('loss'),
])

Building the new columns based on columns profit and loss:

beginning_balance = 1000.0
df = df.with_columns(
    (beginning_balance + (pl.col('profit') - pl.col('loss')).cumsum())
    .alias('closing_balance')
).with_columns(
    pl.col('closing_balance').shift_and_fill(1, beginning_balance)
    .alias('beginning_balance')
)

The tally is calculated by using .cumsum() on the difference between profit and loss.

Answered By: Timus