How to flatten/split a tuple of arrays and calculate column means in Polars dataframe?
Question:
I have a dataframe as follows:
df = pl.DataFrame(
{"a": [([1, 2, 3], [2, 3, 4], [6, 7, 8]), ([1, 2, 3], [3, 4, 5], [5, 7, 9])]}
)
Basically, each cell of a
is a tuple of three arrays of the same length. I want to fully split them to separate columns (one scalar resides in one column) like the shape below:
shape: (2, 9)
┌─────────┬─────────┬─────────┬─────────┬─────┬─────────┬─────────┬─────────┬─────────┐
│ field_0 ┆ field_1 ┆ field_2 ┆ field_3 ┆ ... ┆ field_5 ┆ field_6 ┆ field_7 ┆ field_8 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪═════════╪═════════╪═════╪═════════╪═════════╪═════════╪═════════╡
│ 1 ┆ 2 ┆ 3 ┆ 2 ┆ ... ┆ 4 ┆ 6 ┆ 7 ┆ 8 │
│ 1 ┆ 2 ┆ 3 ┆ 3 ┆ ... ┆ 5 ┆ 5 ┆ 7 ┆ 9 │
└─────────┴─────────┴─────────┴─────────┴─────┴─────────┴─────────┴─────────┴─────────┘
One way I have tried is to use arr.to_struct
and unnest
two times to fully flatten the two nested levels. Two levels is fine here, but if there are a variety of nested levels and the number could not be determined ahead, the code will be so long.
Is there any simpler (or more systematic) way to achieve this?
Answers:
Not sure if this is a good idea, but I used _string_repr()
and counted how many times list
appears in the schema and flatten that many times – 1 (and do a reshape at the end to retain the original number of rows). This also works with more highly nested lists.
def append_arr_explode(expr: pl.Expr, n: int) -> pl.Expr:
for _ in range(n):
expr = expr.arr.explode()
return expr
print(
df.lazy()
.select(
pl.col("a")
.pipe(append_arr_explode, df.schema["a"]._string_repr().count("list") - 1)
.reshape((df.height, -1))
.arr.to_struct()
)
.unnest("a")
.collect()
)
Here some solution. You can use pipe: .arr.explode().arr.explode().[...]
till you unpack all levels of nested list.
# add more of these exprs according
df.select( # to the number of nested lists (n-1)
pl.col("a").arr.eval( # /
pl.element().arr.explode() # .arr.explode() ...
).arr.get(i).alias(f"field_{i}") for i in range(9)
)
Piggybacking on some of the concepts in the other answers…
We can see how much nesting there is by checking the inner
datatype of the pl.List datatype for the column. This can be the condition for a loop. Here’s an example with one further nesting of lists than the initial question:
df = pl.DataFrame({"a": [([[1, 2], [2, 4], [3, 6]], [[2, 4], [3, 6], [4, 8]], [[6, 12], [7, 14], [8, 16]]), ([[1, 2], [2, 4], [3, 6]], [[3, 6], [4, 8], [5, 10]], [[5, 10], [7, 14], [9, 18]])]})
shape: (2, 1)
┌─────────────────────────────────────┐
│ a │
│ --- │
│ list[list[list[i64]]] │
╞═════════════════════════════════════╡
│ [[[1, 2], [2, 4], [3, 6]], [[2, ... │
│ [[[1, 2], [2, 4], [3, 6]], [[3, ... │
└─────────────────────────────────────┘
expr = pl.col('a')
dtype = df.schema['a']
# alternative: dtype.base_type() == pl.List
while isinstance(dtype.inner, pl.List):
expr = expr.arr.eval(pl.element().arr.explode())
dtype = dtype.inner
df.lazy().select(expr.arr.to_struct()).collect().unnest('a')
┌─────────┬─────────┬─────────┬─────────┬─────┬──────────┬──────────┬──────────┬──────────┐
│ field_0 ┆ field_1 ┆ field_2 ┆ field_3 ┆ ... ┆ field_14 ┆ field_15 ┆ field_16 ┆ field_17 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪═════════╪═════════╪═════╪══════════╪══════════╪══════════╪══════════╡
│ 1 ┆ 2 ┆ 2 ┆ 4 ┆ ... ┆ 7 ┆ 14 ┆ 8 ┆ 16 │
│ 1 ┆ 2 ┆ 2 ┆ 4 ┆ ... ┆ 7 ┆ 14 ┆ 9 ┆ 18 │
└─────────┴─────────┴─────────┴─────────┴─────┴──────────┴──────────┴──────────┴──────────┘
I have a dataframe as follows:
df = pl.DataFrame(
{"a": [([1, 2, 3], [2, 3, 4], [6, 7, 8]), ([1, 2, 3], [3, 4, 5], [5, 7, 9])]}
)
Basically, each cell of a
is a tuple of three arrays of the same length. I want to fully split them to separate columns (one scalar resides in one column) like the shape below:
shape: (2, 9)
┌─────────┬─────────┬─────────┬─────────┬─────┬─────────┬─────────┬─────────┬─────────┐
│ field_0 ┆ field_1 ┆ field_2 ┆ field_3 ┆ ... ┆ field_5 ┆ field_6 ┆ field_7 ┆ field_8 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪═════════╪═════════╪═════╪═════════╪═════════╪═════════╪═════════╡
│ 1 ┆ 2 ┆ 3 ┆ 2 ┆ ... ┆ 4 ┆ 6 ┆ 7 ┆ 8 │
│ 1 ┆ 2 ┆ 3 ┆ 3 ┆ ... ┆ 5 ┆ 5 ┆ 7 ┆ 9 │
└─────────┴─────────┴─────────┴─────────┴─────┴─────────┴─────────┴─────────┴─────────┘
One way I have tried is to use arr.to_struct
and unnest
two times to fully flatten the two nested levels. Two levels is fine here, but if there are a variety of nested levels and the number could not be determined ahead, the code will be so long.
Is there any simpler (or more systematic) way to achieve this?
Not sure if this is a good idea, but I used _string_repr()
and counted how many times list
appears in the schema and flatten that many times – 1 (and do a reshape at the end to retain the original number of rows). This also works with more highly nested lists.
def append_arr_explode(expr: pl.Expr, n: int) -> pl.Expr:
for _ in range(n):
expr = expr.arr.explode()
return expr
print(
df.lazy()
.select(
pl.col("a")
.pipe(append_arr_explode, df.schema["a"]._string_repr().count("list") - 1)
.reshape((df.height, -1))
.arr.to_struct()
)
.unnest("a")
.collect()
)
Here some solution. You can use pipe: .arr.explode().arr.explode().[...]
till you unpack all levels of nested list.
# add more of these exprs according
df.select( # to the number of nested lists (n-1)
pl.col("a").arr.eval( # /
pl.element().arr.explode() # .arr.explode() ...
).arr.get(i).alias(f"field_{i}") for i in range(9)
)
Piggybacking on some of the concepts in the other answers…
We can see how much nesting there is by checking the inner
datatype of the pl.List datatype for the column. This can be the condition for a loop. Here’s an example with one further nesting of lists than the initial question:
df = pl.DataFrame({"a": [([[1, 2], [2, 4], [3, 6]], [[2, 4], [3, 6], [4, 8]], [[6, 12], [7, 14], [8, 16]]), ([[1, 2], [2, 4], [3, 6]], [[3, 6], [4, 8], [5, 10]], [[5, 10], [7, 14], [9, 18]])]})
shape: (2, 1)
┌─────────────────────────────────────┐
│ a │
│ --- │
│ list[list[list[i64]]] │
╞═════════════════════════════════════╡
│ [[[1, 2], [2, 4], [3, 6]], [[2, ... │
│ [[[1, 2], [2, 4], [3, 6]], [[3, ... │
└─────────────────────────────────────┘
expr = pl.col('a')
dtype = df.schema['a']
# alternative: dtype.base_type() == pl.List
while isinstance(dtype.inner, pl.List):
expr = expr.arr.eval(pl.element().arr.explode())
dtype = dtype.inner
df.lazy().select(expr.arr.to_struct()).collect().unnest('a')
┌─────────┬─────────┬─────────┬─────────┬─────┬──────────┬──────────┬──────────┬──────────┐
│ field_0 ┆ field_1 ┆ field_2 ┆ field_3 ┆ ... ┆ field_14 ┆ field_15 ┆ field_16 ┆ field_17 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪═════════╪═════════╪═════╪══════════╪══════════╪══════════╪══════════╡
│ 1 ┆ 2 ┆ 2 ┆ 4 ┆ ... ┆ 7 ┆ 14 ┆ 8 ┆ 16 │
│ 1 ┆ 2 ┆ 2 ┆ 4 ┆ ... ┆ 7 ┆ 14 ┆ 9 ┆ 18 │
└─────────┴─────────┴─────────┴─────────┴─────┴──────────┴──────────┴──────────┴──────────┘