python-polars split string column into many columns by delimiter

Question:

In pandas, the following code will split the string from col1 into many columns. is there a way to do this in polars?

d = {'col1': ["a/b/c/d", "a/b/c/d"]}
df= pd.DataFrame(data=d)
df[["a","b","c","d"]]=df["col1"].str.split('/',expand=True)
Asked By: tommyt

||

Answers:

You can use apply() method

import polars as pl
from polars import col

df = pl.DataFrame({
    'col1': ["a/b/c/d", "e/f/j/k"]
})

print(df)

df:

shape: (2, 1)
┌─────────┐
│ col1    │
│ ---     │
│ str     │
╞═════════╡
│ a/b/c/d │
├╌╌╌╌╌╌╌╌╌┤
│ e/f/j/k │
└─────────┘

With apply()

df = df.with_columns([
    col('col1'),
    *[col('col1').apply(lambda s, i=i: s.split('/')[i]).alias(col_name)
      for i, col_name in enumerate(['a', 'b', 'c', 'd'])]

    # or without 'for'
    # col('col1').apply(lambda s: s.split('/')[0]).alias('a'),
    # col('col1').apply(lambda s: s.split('/')[1]).alias('b'),
    # col('col1').apply(lambda s: s.split('/')[2]).alias('c'),
    # col('col1').apply(lambda s: s.split('/')[3]).alias('d')
])

print(df)

df:

shape: (2, 5)
┌─────────┬─────┬─────┬─────┬─────┐
│ col1    ┆ a   ┆ b   ┆ c   ┆ d   │
│ ---     ┆ --- ┆ --- ┆ --- ┆ --- │
│ str     ┆ str ┆ str ┆ str ┆ str │
╞═════════╪═════╪═════╪═════╪═════╡
│ a/b/c/d ┆ a   ┆ b   ┆ c   ┆ d   │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ e/f/j/k ┆ e   ┆ f   ┆ j   ┆ k   │
└─────────┴─────┴─────┴─────┴─────┘

It works, but probably there is more accurate way)

Answered By: glebcom

With this way you do the string split to turn col1 into a list of strings. Then you loop over the lists and use .arr.get to extract each element into a separate column

(df
.with_column(pl.col("col1").str.split("/"))
.with_columns(
[pl.col("col1").arr.get(i).alias(str(i)) for i in range(len(df[0,"col1"].split('/')))
]
)
)

One challenge is whether you will have the same number of elements in the list in each row. In this solution I’ve assumed you have and have taken the length of the list in the first row to do the loop.

Answered By: braaannigan

Here’s an algorithm that will automatically adjust for the required number of columns — and should be quite performant.

Let’s start with this data. Notice that I’ve purposely added the empty string "" and a null value – to show how the algorithm handles these values. Also, the number of split strings varies widely.

import polars as pl
df = pl.DataFrame(
    {
        "my_str": ["cat", "cat/dog", None, "", "cat/dog/aardvark/mouse/frog"],
    }
)
df
shape: (5, 1)
┌─────────────────────────────┐
│ my_str                      │
│ ---                         │
│ str                         │
╞═════════════════════════════╡
│ cat                         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat/dog                     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null                        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│                             │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat/dog/aardvark/mouse/frog │
└─────────────────────────────┘

The Algorithm

The algorithm below may be a bit more than you need, but you can edit/delete/add as you need.

(
    df
    .with_row_count('id')
    .with_column(pl.col("my_str").str.split("/").alias("split_str"))
    .explode("split_str")
    .with_column(
        ("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2))
        .over("id")
        .alias("col_nm")
    )
    .pivot(
        index=['id', 'my_str'],
        values='split_str',
        columns='col_nm',
    )
    .with_column(
        pl.col('^string_.*$').fill_null("")
    )
)
shape: (5, 7)
┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ id  ┆ my_str                      ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
│ --- ┆ ---                         ┆ ---       ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ u32 ┆ str                         ┆ str       ┆ str       ┆ str       ┆ str       ┆ str       │
╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0   ┆ cat                         ┆ cat       ┆           ┆           ┆           ┆           │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ cat/dog                     ┆ cat       ┆ dog       ┆           ┆           ┆           │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ null                        ┆           ┆           ┆           ┆           ┆           │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆                             ┆           ┆           ┆           ┆           ┆           │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ cat       ┆ dog       ┆ aardvark  ┆ mouse     ┆ frog      │
└─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘

How it works

We first assign a row number id (which we’ll need later), and use split to separate the strings. Note that the split strings form a list.

(
    df
    .with_row_count('id')
    .with_column(pl.col("my_str").str.split("/").alias("split_str"))
)
shape: (5, 3)
┌─────┬─────────────────────────────┬────────────────────────────┐
│ id  ┆ my_str                      ┆ split_str                  │
│ --- ┆ ---                         ┆ ---                        │
│ u32 ┆ str                         ┆ list[str]                  │
╞═════╪═════════════════════════════╪════════════════════════════╡
│ 0   ┆ cat                         ┆ ["cat"]                    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ cat/dog                     ┆ ["cat", "dog"]             │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ null                        ┆ null                       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆                             ┆ [""]                       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ ["cat", "dog", ... "frog"] │
└─────┴─────────────────────────────┴────────────────────────────┘

Next, we’ll use explode to put each string on its own row. (Notice how the id column tracks the original row that each string came from.)

(
    df
    .with_row_count('id')
    .with_column(pl.col("my_str").str.split("/").alias("split_str"))
    .explode("split_str")
)
shape: (10, 3)
┌─────┬─────────────────────────────┬───────────┐
│ id  ┆ my_str                      ┆ split_str │
│ --- ┆ ---                         ┆ ---       │
│ u32 ┆ str                         ┆ str       │
╞═════╪═════════════════════════════╪═══════════╡
│ 0   ┆ cat                         ┆ cat       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ cat/dog                     ┆ cat       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ cat/dog                     ┆ dog       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ null                        ┆ null      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆                             ┆           │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ cat       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ dog       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ aardvark  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ mouse     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ frog      │
└─────┴─────────────────────────────┴───────────┘

In the next step, we’re going to generate our column names. I chose to call each column string_XX where XX is the offset with regards to the original string.

I’ve used the handy zfill expression so that 1 becomes 01. (This makes sure that string_02 comes before string_10 if you decide to sort your columns later.)

You can substitute your own naming in this step as you need.

(
    df
    .with_row_count('id')
    .with_column(pl.col("my_str").str.split("/").alias("split_str"))
    .explode("split_str")
    .with_column(
        ("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2))
        .over("id")
        .alias("col_nm")
    )
)
shape: (10, 4)
┌─────┬─────────────────────────────┬───────────┬───────────┐
│ id  ┆ my_str                      ┆ split_str ┆ col_nm    │
│ --- ┆ ---                         ┆ ---       ┆ ---       │
│ u32 ┆ str                         ┆ str       ┆ str       │
╞═════╪═════════════════════════════╪═══════════╪═══════════╡
│ 0   ┆ cat                         ┆ cat       ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ cat/dog                     ┆ cat       ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ cat/dog                     ┆ dog       ┆ string_01 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ null                        ┆ null      ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆                             ┆           ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ cat       ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ dog       ┆ string_01 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ aardvark  ┆ string_02 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ mouse     ┆ string_03 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ frog      ┆ string_04 │
└─────┴─────────────────────────────┴───────────┴───────────┘

In the next step, we’ll use the pivot function to place each string in its own column.

(
    df
    .with_row_count('id')
    .with_column(pl.col("my_str").str.split("/").alias("split_str"))
    .explode("split_str")
    .with_column(
        ("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2))
        .over("id")
        .alias("col_nm")
    )
    .pivot(
        index=['id', 'my_str'],
        values='split_str',
        columns='col_nm',
    )
)
shape: (5, 7)
┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ id  ┆ my_str                      ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
│ --- ┆ ---                         ┆ ---       ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ u32 ┆ str                         ┆ str       ┆ str       ┆ str       ┆ str       ┆ str       │
╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0   ┆ cat                         ┆ cat       ┆ null      ┆ null      ┆ null      ┆ null      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ cat/dog                     ┆ cat       ┆ dog       ┆ null      ┆ null      ┆ null      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ null                        ┆ null      ┆ null      ┆ null      ┆ null      ┆ null      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆                             ┆           ┆ null      ┆ null      ┆ null      ┆ null      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ cat       ┆ dog       ┆ aardvark  ┆ mouse     ┆ frog      │
└─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘

All that remains is to use fill_null to replace the null values with an empty string "". Notice that I’ve used a regex expression in the col expression to target only those columns whose names start with "string_". (Depending on your other data, you may not want to replace null with "" everywhere in your data.)

Answered By: user18559875

You can use struct datatype, as described in this post: https://stackoverflow.com/a/74219166:


import pandas as pl

df = pl.DataFrame({
       "my_str": ["cat", "cat/dog", None, "", "cat/dog/aardvark/mouse/frog"],
})
df.select(pl.col('my_str').str.split('/')
    .arr.to_struct(n_field_strategy="max_width")).unnest('my_str')

Notice you must use n_field_strategy="max_width", otherwise, unnest() will create only 1 column.

Answered By: waltersantosf
import polars as pl

#Create new column list(can be created dynamically as well)

new_cols=['new_col1','new_col2','new_col3',.....,new_coln]

#Define expression

expr = [pl.col('col1').str.split('/').arr.get(i).alias(col)
        for i,col in enumerate(new_cols)
        ] 
        
#Apply Expression

df.with_columns(expr)
Answered By: amit r
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.