Numpy array to list of lists in polars dataframe

Question:

I’m trying to save a dataframe with a 2D list in each cell to a parquet file. As example I created a polars dataframe with a 2D list. As can be seen in the table the dtype of both columns is list[list[i64]].

┌─────────────────────┬─────────────────────┐
│ a                   ┆ b                   │
│ ---                 ┆ ---                 │
│ list[list[i64]]     ┆ list[list[i64]]     │
╞═════════════════════╪═════════════════════╡
│ [[1], [2], ... [4]] ┆ [[1], [2], ... [4]] │
│ [[1], [2], ... [4]] ┆ [[1], [2], ... [4]] │
└─────────────────────┴─────────────────────┘

In the code below I saved and read the dataframe to check whether it is indeed possible to write and read this dataframe to and from a parquet file.

After this step I created a numpy array from the dataframe. This is where the problem starts. Converting back to a polars dataframe is still possible. Despite the fact that the dtype of both columns now an object is.

┌─────────────────────────────────────┬─────────────────────────────────────┐
│ a                                   ┆ b                                   │
│ ---                                 ┆ ---                                 │
│ object                              ┆ object                              │
╞═════════════════════════════════════╪═════════════════════════════════════╡
│ [array([array([1], dtype=int64),... ┆ [array([array([1], dtype=int64),... │
│ [array([array([1], dtype=int64),... ┆ [array([array([1], dtype=int64),... │
└─────────────────────────────────────┴─────────────────────────────────────┘

Now, when I try to write this dataframe to a parquet file the following error pops up: Exception has occurred: PanicException cannot convert object to arrow. Which is indeed true because the dtypes are now objects.

I tried using pl.from_numpy() but this complains on reading 2D arrays. I also tried casting but casting from an object seems not possible. Creating the dataframe with the previous dtype does also not seem to work.

Question:
How can I still write this dataframe to a parquet file? Preferably with dtype list[list[i64]]. I need to keep the 2D array structure.

By just creating the desired result as a list I’m able to write a read but not when it is a numpy array.

Proof code:

import polars as pl
import numpy as np

data = {
    "a": [[[[1],[2],[3],[4]], [[1],[2],[3],[4]]],
          [[[1],[2],[3],[4]], [[1],[2],[3],[4]]]], 
    "b": [[[[1],[2],[3],[4]], [[1],[2],[3],[4]]],
          [[[1],[2],[3],[4]], [[1],[2],[3],[4]]]]
}

df = pl.DataFrame(data)
df.write_parquet('test.parquet')

read_df = pl.read_parquet('test.parquet')
print(read_df)

Proof result:

┌─────────────────────────────────────┬─────────────────────────────────────┐
│ a                                   ┆ b                                   │
│ ---                                 ┆ ---                                 │
│ list[list[list[i64]]]               ┆ list[list[list[i64]]]               │
╞═════════════════════════════════════╪═════════════════════════════════════╡
│ [[[1], [2], ... [4]], [[1], [2],... ┆ [[[1], [2], ... [4]], [[1], [2],... │
│ [[[1], [2], ... [4]], [[1], [2],... ┆ [[[1], [2], ... [4]], [[1], [2],... │
└─────────────────────────────────────┴─────────────────────────────────────┘

Sample code:

import polars as pl
import numpy as np

data = {
    "a": [[[1],[2],[3],[4]], [[1],[2],[3],[4]]], 
    "b": [[[1],[2],[3],[4]], [[1],[2],[3],[4]]]
}

df = pl.DataFrame(data)
df.write_parquet('test.parquet')

read_df = pl.read_parquet('test.parquet')
print(read_df)

arr = np.dstack([read_df, df])

# schema={'a': list[list[pl.Int32]], 'b': list[list[pl.Int32]]}
combined = pl.DataFrame(arr.tolist(), schema=df.columns)
print(combined)

# combined.with_column(pl.col('a').cast(pl.List, strict=False).alias('a_list'))

combined.write_parquet('test_result.parquet')
Asked By: Sam

||

Answers:

Perhaps there is a simpler approach – but you could do the "stacking" with explode/groupby:

frames = df, read_df

frames = (
   frame.with_columns(col=n)
        .with_row_count("row")
        .explode(pl.exclude("row", "col"))
   for n, frame in enumerate(frames)
)

combined = (
   pl.concat(frames)
     .groupby("row", "col", maintain_order=True)
     .agg(pl.all())
     .groupby("row", maintain_order=True)
     .agg(pl.exclude("col"))
     .drop("row")
)
shape: (2, 2)
┌─────────────────────────────────────┬─────────────────────────────────────┐
│ a                                   | b                                   │
│ ---                                 | ---                                 │
│ list[list[list[i64]]]               | list[list[list[i64]]]               │
╞═════════════════════════════════════╪═════════════════════════════════════╡
│ [[[1], [2], ... [4]], [[1], [2],... | [[[1], [2], ... [4]], [[1], [2],... │
│ [[[1], [2], ... [4]], [[1], [2],... | [[[1], [2], ... [4]], [[1], [2],... │
└─────────────────────────────────────┴─────────────────────────────────────┘

I thought .concat_list may be of use – but they are merged:

(df.hstack(read_df.select(pl.all().suffix("_right")))
   .select(pl.concat_list(["a", "a_right"]))
   .limit(1).item())
shape: (8,)
Series: 'a' [list[i64]]
[
    [1]
    [2]
    [3]
    [4]
    [1]
    [2]
    [3]
    [4]
]
Answered By: jqurious