Polars subtract numpy 1xn array from n columns

Question:

I am struggling with polars. I have a dataframe and an numpy array. I would like to subtract them.

import polars as pl
import pandas as pd

df = pl.DataFrame(np.random.randn(6, 4), schema=['#', 'x', 'y', 'z'])

arr = np.array([-10, -20, -30])


df.select(
    pl.col(r'^[x|y|z]$')
).apply(
    lambda x: np.array(x) - arr
)

shape: (6, 3)
┌───────────┬───────────┬───────────┐
│ column_0  ┆ column_1  ┆ column_2  │
│ ---       ┆ ---       ┆ ---       │
│ f64       ┆ f64       ┆ f64       │
╞═══════════╪═══════════╪═══════════╡
│ 10.143819 ┆ 21.875335 ┆ 29.682364 │
│ null      ┆ null      ┆ null      │
│ null      ┆ null      ┆ null      │
│ null      ┆ null      ┆ null      │
│ null      ┆ null      ┆ null      │
│ null      ┆ null      ┆ null      │
└───────────┴───────────┴───────────┘

So now the subtraction is only applied at the first row.

But if I try to calculate the norm for example, then it works for every row:

df.select(
    pl.col(r'^[x|y|z]$')
).apply(
    lambda x: np.sum((np.array(x) - arr)**2)**0.5
)
shape: (6, 1)
┌───────────┐
│ apply     │
│ ---       │
│ f64       │
╞═══════════╡
│ 38.242255 │
│ 37.239545 │
│ 38.07624  │
│ 36.688312 │
│ 38.419194 │
│ 36.262196 │
└───────────┘

# check if it is correct:
np.sum((df.to_pandas()[['x', 'y', 'z']].to_numpy() - arr)**2, axis=1) ** 0.5
>>> array([38.24225488, 37.23954478, 38.07623986, 36.68831161, 38.41919409,
       36.2621962 ])

In pandas one can do it like this:

df.to_pandas()[['x', 'y', 'z']] - arr

x   y   z
0   10.143819   21.875335   29.682364
1   10.360651   21.116404   28.871060
2   9.777666    20.846593   30.325185
3   9.394726    19.357053   29.716592
4   9.223525    21.618511   30.390805
5   9.751234    21.667080   27.393393

One way it will work is to do it for each column separately. But that means a lot of the same code, especially when the number of columns are increasing:

df.select(
    pl.col('x') - arr[0], pl.col('y') - arr[1], pl.col('z') - arr[2]
)
Asked By: 3dSpatialUser

||

Answers:

You can match the pandas output

In [15]: df.to_pandas()[['x', 'y', 'z']] - arr
Out[15]:
           x          y          z
0  10.342991  21.258934  29.083287
1  10.136803  21.543558  28.168207
2  11.900141  19.557348  29.490541
3   9.192346  19.498689  28.195094
4   9.219745  20.330358  29.005278
5  11.853378  19.458095  30.357041

with

In [17]: df.select([pl.col(col)-arr[i] for i, col in enumerate(['x', 'y', 'z'])])
Out[17]:
shape: (6, 3)
┌───────────┬───────────┬───────────┐
│ x         ┆ y         ┆ z         │
│ ---       ┆ ---       ┆ ---       │
│ f64       ┆ f64       ┆ f64       │
╞═══════════╪═══════════╪═══════════╡
│ 10.342991 ┆ 21.258934 ┆ 29.083287 │
│ 10.136803 ┆ 21.543558 ┆ 28.168207 │
│ 11.900141 ┆ 19.557348 ┆ 29.490541 │
│ 9.192346  ┆ 19.498689 ┆ 28.195094 │
│ 9.219745  ┆ 20.330358 ┆ 29.005278 │
│ 11.853378 ┆ 19.458095 ┆ 30.357041 │
└───────────┴───────────┴───────────┘
Answered By: ignoring_gravity

I saw for a short time the answer I was looking for, but the comment is removed.

The solution was to return a tuple:

df.select(
    pl.col(r'^[x|y|z]$')
).apply(
    # lambda x: np.array(x) - arr  # old code
    lambda x: tuple(np.array(x) - arr)  # new code
)
Answered By: 3dSpatialUser

There are a few things going on in this question.

The first is that you really really don’t want to use apply unless you’re doing something that is a custom python function

the apply expression passes elements of the column to the python function.
Note that you are now running python, this will be slow.

There’s not really a polars way to do what you want. When polars sees pl.col(r'^[x|y|z]$').expr it’s going to identify each column that fits the regex and then there will be a thread doing the work of whatever the rest of the expression is. The expression doesn’t know where in the order it was. It only knows what its data is and what it’s supposed to do. Therefore, there’s nothing you can put in the expr for it to know which element in the array to access.

To get at what you want you have to do something like @ignoring_gravity had but you can use the re module.

import re
df.select(pl.col(col)-arr[i] 
          for i, col in enumerate(filter(re.compile(r'^[x|y|z]$').match, df.columns)))
Answered By: Dean MacGregor

Another option that avoids the re import would be:

res = df.select(
    pl.col(col) - c
    for col, c in zip(df.select(pl.col(r'^[x|y|z]$')).columns, arr)
)

This is slightly slower for very small dataframes (I guess because it is then dominated by the regex speed) but equally fast for larger ones.

Answered By: Timus