Polars subtract numpy 1xn array from n columns
Question:
I am struggling with polars. I have a dataframe and an numpy array. I would like to subtract them.
import polars as pl
import pandas as pd
df = pl.DataFrame(np.random.randn(6, 4), schema=['#', 'x', 'y', 'z'])
arr = np.array([-10, -20, -30])
df.select(
pl.col(r'^[x|y|z]$')
).apply(
lambda x: np.array(x) - arr
)
shape: (6, 3)
┌───────────┬───────────┬───────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════╪═══════════╡
│ 10.143819 ┆ 21.875335 ┆ 29.682364 │
│ null ┆ null ┆ null │
│ null ┆ null ┆ null │
│ null ┆ null ┆ null │
│ null ┆ null ┆ null │
│ null ┆ null ┆ null │
└───────────┴───────────┴───────────┘
So now the subtraction is only applied at the first row.
But if I try to calculate the norm for example, then it works for every row:
df.select(
pl.col(r'^[x|y|z]$')
).apply(
lambda x: np.sum((np.array(x) - arr)**2)**0.5
)
shape: (6, 1)
┌───────────┐
│ apply │
│ --- │
│ f64 │
╞═══════════╡
│ 38.242255 │
│ 37.239545 │
│ 38.07624 │
│ 36.688312 │
│ 38.419194 │
│ 36.262196 │
└───────────┘
# check if it is correct:
np.sum((df.to_pandas()[['x', 'y', 'z']].to_numpy() - arr)**2, axis=1) ** 0.5
>>> array([38.24225488, 37.23954478, 38.07623986, 36.68831161, 38.41919409,
36.2621962 ])
In pandas one can do it like this:
df.to_pandas()[['x', 'y', 'z']] - arr
x y z
0 10.143819 21.875335 29.682364
1 10.360651 21.116404 28.871060
2 9.777666 20.846593 30.325185
3 9.394726 19.357053 29.716592
4 9.223525 21.618511 30.390805
5 9.751234 21.667080 27.393393
One way it will work is to do it for each column separately. But that means a lot of the same code, especially when the number of columns are increasing:
df.select(
pl.col('x') - arr[0], pl.col('y') - arr[1], pl.col('z') - arr[2]
)
Answers:
You can match the pandas output
In [15]: df.to_pandas()[['x', 'y', 'z']] - arr
Out[15]:
x y z
0 10.342991 21.258934 29.083287
1 10.136803 21.543558 28.168207
2 11.900141 19.557348 29.490541
3 9.192346 19.498689 28.195094
4 9.219745 20.330358 29.005278
5 11.853378 19.458095 30.357041
with
In [17]: df.select([pl.col(col)-arr[i] for i, col in enumerate(['x', 'y', 'z'])])
Out[17]:
shape: (6, 3)
┌───────────┬───────────┬───────────┐
│ x ┆ y ┆ z │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════╪═══════════╡
│ 10.342991 ┆ 21.258934 ┆ 29.083287 │
│ 10.136803 ┆ 21.543558 ┆ 28.168207 │
│ 11.900141 ┆ 19.557348 ┆ 29.490541 │
│ 9.192346 ┆ 19.498689 ┆ 28.195094 │
│ 9.219745 ┆ 20.330358 ┆ 29.005278 │
│ 11.853378 ┆ 19.458095 ┆ 30.357041 │
└───────────┴───────────┴───────────┘
I saw for a short time the answer I was looking for, but the comment is removed.
The solution was to return a tuple:
df.select(
pl.col(r'^[x|y|z]$')
).apply(
# lambda x: np.array(x) - arr # old code
lambda x: tuple(np.array(x) - arr) # new code
)
There are a few things going on in this question.
The first is that you really really don’t want to use apply
unless you’re doing something that is a custom python function
the apply expression passes elements of the column to the python function.
Note that you are now running python, this will be slow.
There’s not really a polars way to do what you want. When polars sees pl.col(r'^[x|y|z]$').expr
it’s going to identify each column that fits the regex and then there will be a thread doing the work of whatever the rest of the expression is. The expression doesn’t know where in the order it was. It only knows what its data is and what it’s supposed to do. Therefore, there’s nothing you can put in the expr
for it to know which element in the array to access.
To get at what you want you have to do something like @ignoring_gravity had but you can use the re
module.
import re
df.select(pl.col(col)-arr[i]
for i, col in enumerate(filter(re.compile(r'^[x|y|z]$').match, df.columns)))
Another option that avoids the re
import would be:
res = df.select(
pl.col(col) - c
for col, c in zip(df.select(pl.col(r'^[x|y|z]$')).columns, arr)
)
This is slightly slower for very small dataframes (I guess because it is then dominated by the regex speed) but equally fast for larger ones.
I am struggling with polars. I have a dataframe and an numpy array. I would like to subtract them.
import polars as pl
import pandas as pd
df = pl.DataFrame(np.random.randn(6, 4), schema=['#', 'x', 'y', 'z'])
arr = np.array([-10, -20, -30])
df.select(
pl.col(r'^[x|y|z]$')
).apply(
lambda x: np.array(x) - arr
)
shape: (6, 3)
┌───────────┬───────────┬───────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════╪═══════════╡
│ 10.143819 ┆ 21.875335 ┆ 29.682364 │
│ null ┆ null ┆ null │
│ null ┆ null ┆ null │
│ null ┆ null ┆ null │
│ null ┆ null ┆ null │
│ null ┆ null ┆ null │
└───────────┴───────────┴───────────┘
So now the subtraction is only applied at the first row.
But if I try to calculate the norm for example, then it works for every row:
df.select(
pl.col(r'^[x|y|z]$')
).apply(
lambda x: np.sum((np.array(x) - arr)**2)**0.5
)
shape: (6, 1)
┌───────────┐
│ apply │
│ --- │
│ f64 │
╞═══════════╡
│ 38.242255 │
│ 37.239545 │
│ 38.07624 │
│ 36.688312 │
│ 38.419194 │
│ 36.262196 │
└───────────┘
# check if it is correct:
np.sum((df.to_pandas()[['x', 'y', 'z']].to_numpy() - arr)**2, axis=1) ** 0.5
>>> array([38.24225488, 37.23954478, 38.07623986, 36.68831161, 38.41919409,
36.2621962 ])
In pandas one can do it like this:
df.to_pandas()[['x', 'y', 'z']] - arr
x y z
0 10.143819 21.875335 29.682364
1 10.360651 21.116404 28.871060
2 9.777666 20.846593 30.325185
3 9.394726 19.357053 29.716592
4 9.223525 21.618511 30.390805
5 9.751234 21.667080 27.393393
One way it will work is to do it for each column separately. But that means a lot of the same code, especially when the number of columns are increasing:
df.select(
pl.col('x') - arr[0], pl.col('y') - arr[1], pl.col('z') - arr[2]
)
You can match the pandas output
In [15]: df.to_pandas()[['x', 'y', 'z']] - arr
Out[15]:
x y z
0 10.342991 21.258934 29.083287
1 10.136803 21.543558 28.168207
2 11.900141 19.557348 29.490541
3 9.192346 19.498689 28.195094
4 9.219745 20.330358 29.005278
5 11.853378 19.458095 30.357041
with
In [17]: df.select([pl.col(col)-arr[i] for i, col in enumerate(['x', 'y', 'z'])])
Out[17]:
shape: (6, 3)
┌───────────┬───────────┬───────────┐
│ x ┆ y ┆ z │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════╪═══════════╡
│ 10.342991 ┆ 21.258934 ┆ 29.083287 │
│ 10.136803 ┆ 21.543558 ┆ 28.168207 │
│ 11.900141 ┆ 19.557348 ┆ 29.490541 │
│ 9.192346 ┆ 19.498689 ┆ 28.195094 │
│ 9.219745 ┆ 20.330358 ┆ 29.005278 │
│ 11.853378 ┆ 19.458095 ┆ 30.357041 │
└───────────┴───────────┴───────────┘
I saw for a short time the answer I was looking for, but the comment is removed.
The solution was to return a tuple:
df.select(
pl.col(r'^[x|y|z]$')
).apply(
# lambda x: np.array(x) - arr # old code
lambda x: tuple(np.array(x) - arr) # new code
)
There are a few things going on in this question.
The first is that you really really don’t want to use apply
unless you’re doing something that is a custom python function
the apply expression passes elements of the column to the python function.
Note that you are now running python, this will be slow.
There’s not really a polars way to do what you want. When polars sees pl.col(r'^[x|y|z]$').expr
it’s going to identify each column that fits the regex and then there will be a thread doing the work of whatever the rest of the expression is. The expression doesn’t know where in the order it was. It only knows what its data is and what it’s supposed to do. Therefore, there’s nothing you can put in the expr
for it to know which element in the array to access.
To get at what you want you have to do something like @ignoring_gravity had but you can use the re
module.
import re
df.select(pl.col(col)-arr[i]
for i, col in enumerate(filter(re.compile(r'^[x|y|z]$').match, df.columns)))
Another option that avoids the re
import would be:
res = df.select(
pl.col(col) - c
for col, c in zip(df.select(pl.col(r'^[x|y|z]$')).columns, arr)
)
This is slightly slower for very small dataframes (I guess because it is then dominated by the regex speed) but equally fast for larger ones.