Polars Create Column with String Formatting

Question:

I have a polars dataframe:

df = pl.DataFrame({'schema_name': ['test_schema', 'test_schema_2'], 
                       'table_name': ['test_table', 'test_table_2'],
                       'column_name': ['test_column, test_column_2','test_column']})
schema_name table_name column_name
test_schema test_table test_column, test_column_2
test_schema_2 test_table_2 test_column

I have a string:

date_field_value_max_query = '''
    select '{0}' as schema_name, 
           '{1}' as table_name, 
           greatest({2})
    from {0}.{1}
    group by 1, 2
'''

I would like to use polars to add a column by using string formatting. The target dataframe is this:

schema_name table_name column_name query
test_schema test_table test_column, test_column_2 select test_schema, test_table, greatest(test_column, test_column_2) from test_schema.test_table group by 1, 2
test_schema_2 test_table_2 test_column select test_schema_2, test_table_2, greatest(test_column) from test_schema_2.test_table_2 group by 1, 2

In pandas, I would do something like this:

df.apply(lambda row: date_field_value_max_query.format(row['schema_name'], row['table_name'], row['column_name']), axis=1)

For polars, I tried this:

df.with_column(
    (date_field_value_max_query.format(pl.col('schema_name'), pl.col('table_name'), pl.col('column_name')))
)

This doesn’t work, because with_column expects a single expression. I am able to get the output I want by doing this…

df.apply(lambda row: date_field_value_max_query.format(row[0], row[1], row[2]))

…but this returns only the one column, and I lose the original three columns. I know this approach is also not recommended for polars, when possible.

How can I perform string formatting across multiple dataframe columns with the output column attached to the original dataframe?

Answers:

Couldn’t you just:

df['query'] = df.apply(lambda row: date_field_value_max_query.format(row[0], row[1], row[2]))

Am I missing something?

–K

Answered By: Kibeth

[UPDATE]: I was not aware of pl.format()@
ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ’s answer
should be accepted.

df.apply() returns a dataframe – you can stack dataframes horizontally:

df.hstack(
   df.apply(lambda row: date_field_value_max_query.format(*row))
     .rename({"apply": "query"})
)
shape: (2, 4)
┌───────────────┬──────────────┬────────────────────────────┬─────────────────────────────────────┐
│ schema_name   | table_name   | column_name                | query                               │
│ ---           | ---          | ---                        | ---                                 │
│ str           | str          | str                        | str                                 │
╞═══════════════╪══════════════╪════════════════════════════╪═════════════════════════════════════╡
│ test_schema   | test_table   | test_column, test_column_2 | select 'test_schema' as schema_n... │
├───────────────┼──────────────┼────────────────────────────┼─────────────────────────────────────┤
│ test_schema_2 | test_table_2 | test_column                | select 'test_schema_2' as schema... │
└───────────────┴──────────────┴────────────────────────────┴─────────────────────────────────────┘

You can add Series objects as columns:

df.with_columns(
   (df.apply(lambda row: date_field_value_max_query.format(*row))
      .to_series().rename("query"))
)
shape: (2, 4)
┌───────────────┬──────────────┬────────────────────────────┬─────────────────────────────────────┐
│ schema_name   | table_name   | column_name                | query                               │
│ ---           | ---          | ---                        | ---                                 │
│ str           | str          | str                        | str                                 │
╞═══════════════╪══════════════╪════════════════════════════╪═════════════════════════════════════╡
│ test_schema   | test_table   | test_column, test_column_2 | select 'test_schema' as schema_n... │
├───────────────┼──────────────┼────────────────────────────┼─────────────────────────────────────┤
│ test_schema_2 | test_table_2 | test_column                | select 'test_schema_2' as schema... │
└───────────────┴──────────────┴────────────────────────────┴─────────────────────────────────────┘

If you are creating the format string – you could perhaps use an expression instead:

df.with_columns((
   pl.lit("select '") + pl.col("schema_name") + pl.lit("' as schema_name,")
                      + pl.col("table_name")  + pl.lit("' as table_name,")
                      + pl.lit("greatest(")   + pl.col("column_name") + pl.lit(")") + 
   pl.lit("from ")    + pl.col("schema_name") + pl.lit(".") + pl.col("table_name")  + 
   pl.lit("group by 1, 2")
).alias("query"))
shape: (2, 4)
┌───────────────┬──────────────┬────────────────────────────┬─────────────────────────────────────┐
│ schema_name   | table_name   | column_name                | query                               │
│ ---           | ---          | ---                        | ---                                 │
│ str           | str          | str                        | str                                 │
╞═══════════════╪══════════════╪════════════════════════════╪═════════════════════════════════════╡
│ test_schema   | test_table   | test_column, test_column_2 | select 'test_schema' as schema_n... │
├───────────────┼──────────────┼────────────────────────────┼─────────────────────────────────────┤
│ test_schema_2 | test_table_2 | test_column                | select 'test_schema_2' as schema... │
└───────────────┴──────────────┴────────────────────────────┴─────────────────────────────────────┘
Answered By: jqurious

Another option is to use polars.format to create your string. For example:

date_field_value_max_query = (
'''select {} as schema_name,
       {} as table_name,
       greatest({})
    from {}.{}
    group by 1, 2
'''
)

(
    df
    .with_columns([
        pl.format(date_field_value_max_query,
                  'schema_name',
                  'table_name',
                  'column_name',
                  'schema_name',
                  'table_name')
    ])
)
shape: (2, 4)
┌───────────────┬──────────────┬────────────────────────────┬─────────────────────────────────────────────┐
│ schema_name   ┆ table_name   ┆ column_name                ┆ literal                                     │
│ ---           ┆ ---          ┆ ---                        ┆ ---                                         │
│ str           ┆ str          ┆ str                        ┆ str                                         │
╞═══════════════╪══════════════╪════════════════════════════╪═════════════════════════════════════════════╡
│ test_schema   ┆ test_table   ┆ test_column, test_column_2 ┆ select test_schema as schema_name,          │
│               ┆              ┆                            ┆        test_table as table_name,            │
│               ┆              ┆                            ┆        greatest(test_column, test_column_2) │
│               ┆              ┆                            ┆     from test_schema.test_table             │
│               ┆              ┆                            ┆     group by 1, 2                           │
│               ┆              ┆                            ┆                                             │
│               ┆              ┆                            ┆                                             │
│ test_schema_2 ┆ test_table_2 ┆ test_column                ┆ select test_schema_2 as schema_name,        │
│               ┆              ┆                            ┆        test_table_2 as table_name,          │
│               ┆              ┆                            ┆        greatest(test_column)                │
│               ┆              ┆                            ┆     from test_schema_2.test_table_2         │
│               ┆              ┆                            ┆     group by 1, 2                           │
│               ┆              ┆                            ┆                                             │
│               ┆              ┆                            ┆                                             │
└───────────────┴──────────────┴────────────────────────────┴─────────────────────────────────────────────┘
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.