Convert column to Numpy within a select statement in Polars

Question:

I am trying to transform a date column to the next business day after each date (if the date isn’t already a business day in which case it remains unchanged). To do this I am using a Numpy function called busday_offset which takes a Numpy array as its first parameter. The line of code bellow doesn’t work because pl.col(‘creation_date’) isn’t a Polars column but some kind of abstract object. Thus, ".to_numpy" can’t be called on it to convert the column to a numpy array.

sales_orders.with_columns(
    [
        pl.lit(np.busday_offset(pl.col('creation_date').to_numpy(), 0, roll='forward'))
    ]
)

-> AttributeError: ‘Expr’ object has no attribute ‘to_numpy’

I am aware that a workaround would be to use the ".get_column" method on the "sales_orders" table as such:

sales_orders.with_columns(
    [
        pl.lit(np.busday_offset(sales_orders.get_column('creation_date').to_numpy(), 0, roll='forward'))
    ]
)

But this lacks elegance and makes it impossible to have nested "with_columns" statements as such:

sales_orders.with_columns(
    [
        pl.lit(np.busday_offset(sales_orders.get_column('creation_date'), 0, roll='forward')).alias('order_acknowledgement_date')
    ]
).with_columns(
    [
        pl.lit(np.busday_offset(sales_orders.get_column('order_acknowledgement_date'), 2, roll='forward')).alias('ship_due_date')
    ]
)

-> ColumnNotFoundError: order_acknowledgement_date
This indeed doesn’t work because "order_acknowledgement_date" isn’t part of the "sales_orders" table.

So my question is: how can I transform pl.col(‘creation_date’) into a Numpy array within a select/with_columns statement without refering to the table sales_orders directly?

Asked By: Viktor Crettenand

||

Answers:

You can use .map()

df = pl.DataFrame({"date": ["2011-03-20", "2011-03-22", "2011-03-24"]})
df = df.with_columns(pl.col("date").str.to_date())

df.with_columns(busday = 
   pl.col("date").map(lambda date: 
      pl.Series(
         np.busday_offset(date, 0, roll="forward")
           .astype("datetime64[ms]")
      )
      .cast(pl.Date)
   )
)
shape: (3, 2)
┌────────────┬────────────┐
│ date       ┆ busday     │
│ ---        ┆ ---        │
│ date       ┆ date       │
╞════════════╪════════════╡
│ 2011-03-20 ┆ 2011-03-21 │
│ 2011-03-22 ┆ 2011-03-22 │
│ 2011-03-24 ┆ 2011-03-24 │
└────────────┴────────────┘

Polars doesn’t currently understand numpy.datetime objects, hence the .astype() and .cast().

df.with_columns(busday = 
   pl.col("date").map(lambda date: 
      pl.Series(np.busday_offset(date, 0, roll='forward'))
   )
)

# shape: (3, 2)
# ┌────────────┬────────────┐
# │ date       ┆ busday     │
# │ ---        ┆ ---        │
# │ date       ┆ object     │ # <- dtype object
# ╞════════════╪════════════╡
# │ 2011-03-20 ┆ 2011-03-21 │
# │ 2011-03-22 ┆ 2011-03-22 │
# │ 2011-03-24 ┆ 2011-03-24 │
# └────────────┴────────────┘

There is a pending pull request to address this: https://github.com/pola-rs/polars/issues/9779#issuecomment-1627084008

Answered By: jqurious