How to fill a polars dataframe from a numpy array in python

Question:

I am currently working on a dataframe function that assigns values of a numpy array of shape 2 to a given column of a dataframe using the polars library in Python.

I have a dataframe df with the following columns : ['HZ', 'FL', 'Q']. The column 'HZ'takes values in [0, EC + H - 1] and the column 'FL' takes values in [1, F].

I also have a numpy array q of shape (EC + H, F), and I want to assign its values to the column 'Q' in this way :
if df[‘HZ’] >= EC, then df[‘Q’] = q[df[‘HZ’]][df[‘F’] – 1].

You can find below the pandas instruction that does exactly what I want to do.

df.loc[df['HZ'] >= EC, 'Q'] = q[df.loc[df['HZ'] >= EC, 'HZ'], df.loc[df['HZ'] >= EC, 'F'] - 1]

Now I want to do it using polars, and I tried to do it this way:

df = df.with_columns(pl.when(pl.col('HZ') >= EC).then(q[pl.col('HZ')][pl.col('F') - 1]).otherwise(pl.col('Q')).alias('Q'))

And I get the following error :

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

I understand that I don’t give numpy the good format of indexes to get the corresponding value in the array, but I don’t know how to replace it to get the desired behavior.

Thanks by advance

Asked By: Haeden

||

Answers:

By test case/example I meant something like:

df = pl.DataFrame({
    "HZ": [0, 0, 1, 1], 
    "FL": [0, 1, 2, 3], 
    "Q": [0, 0, 0, 0]
})
q = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
EC = 1
>>> df
shape: (4, 3)
┌─────┬─────┬─────┐
│ HZ  ┆ FL  ┆ Q   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 0   ┆ 0   ┆ 0   │
│ 0   ┆ 1   ┆ 0   │
│ 1   ┆ 2   ┆ 0   │
│ 1   ┆ 3   ┆ 0   │
└─────┴─────┴─────┘

The problem with your attempted approach is q[pl.col('HZ') happens before .with_columns executes and numpy does not understand pl.col('HZ')

One way to use the actual values to index the numpy array is by using .map

df.with_columns(Q = 
   pl.when(pl.col("HZ") >= EC)
     .then(
        pl.map(
           ["HZ", pl.col("FL") - 1], 
           lambda cols: q[cols[0], cols[1]])
        .flatten())
     .otherwise(pl.col("Q")))
shape: (4, 3)
┌─────┬─────┬─────┐
│ HZ  ┆ FL  ┆ Q   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 0   ┆ 0   ┆ 0   │
│ 0   ┆ 1   ┆ 0   │
│ 1   ┆ 2   ┆ 6   │
│ 1   ┆ 3   ┆ 7   │
└─────┴─────┴─────┘

It’s slightly awkward to do – it would probably be better to have your data in a better format for polars e.g. another dataframe.

df_q = pl.DataFrame(
   ((row, col, value) for (row, col), value in np.ndenumerate(q)),
   schema=["HZ", "FL", "Q"]
)
>>> df_q
shape: (8, 3)
┌─────┬─────┬─────┐
│ HZ  ┆ FL  ┆ Q   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 0   ┆ 0   ┆ 1   │
│ 0   ┆ 1   ┆ 2   │
│ 0   ┆ 2   ┆ 3   │
│ 0   ┆ 3   ┆ 4   │
│ 1   ┆ 0   ┆ 5   │
│ 1   ┆ 1   ┆ 6   │
│ 1   ┆ 2   ┆ 7   │
│ 1   ┆ 3   ┆ 8   │
└─────┴─────┴─────┘

This would allow you to use a more regular approach to matching values such as a .join

df.join(df_q.with_columns(pl.col("FL") + 1), on=["HZ", "FL"], how="left")
shape: (4, 4)
┌─────┬─────┬─────┬─────────┐
│ HZ  ┆ FL  ┆ Q   ┆ Q_right │
│ --- ┆ --- ┆ --- ┆ ---     │
│ i64 ┆ i64 ┆ i64 ┆ i64     │
╞═════╪═════╪═════╪═════════╡
│ 0   ┆ 0   ┆ 0   ┆ null    │
│ 0   ┆ 1   ┆ 0   ┆ 1       │
│ 1   ┆ 2   ┆ 0   ┆ 6       │
│ 1   ┆ 3   ┆ 0   ┆ 7       │
└─────┴─────┴─────┴─────────┘
Answered By: jqurious
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.