Polars Create Column with String Formatting
Question:
I have a polars dataframe:
df = pl.DataFrame({'schema_name': ['test_schema', 'test_schema_2'],
'table_name': ['test_table', 'test_table_2'],
'column_name': ['test_column, test_column_2','test_column']})
schema_name
table_name
column_name
test_schema
test_table
test_column, test_column_2
test_schema_2
test_table_2
test_column
I have a string:
date_field_value_max_query = '''
select '{0}' as schema_name,
'{1}' as table_name,
greatest({2})
from {0}.{1}
group by 1, 2
'''
I would like to use polars to add a column by using string formatting. The target dataframe is this:
schema_name
table_name
column_name
query
test_schema
test_table
test_column, test_column_2
select test_schema, test_table, greatest(test_column, test_column_2) from test_schema.test_table group by 1, 2
test_schema_2
test_table_2
test_column
select test_schema_2, test_table_2, greatest(test_column) from test_schema_2.test_table_2 group by 1, 2
In pandas, I would do something like this:
df.apply(lambda row: date_field_value_max_query.format(row['schema_name'], row['table_name'], row['column_name']), axis=1)
For polars, I tried this:
df.with_column(
(date_field_value_max_query.format(pl.col('schema_name'), pl.col('table_name'), pl.col('column_name')))
)
This doesn’t work, because with_column
expects a single expression. I am able to get the output I want by doing this…
df.apply(lambda row: date_field_value_max_query.format(row[0], row[1], row[2]))
…but this returns only the one column, and I lose the original three columns. I know this approach is also not recommended for polars, when possible.
How can I perform string formatting across multiple dataframe columns with the output column attached to the original dataframe?
Answers:
Couldn’t you just:
df['query'] = df.apply(lambda row: date_field_value_max_query.format(row[0], row[1], row[2]))
Am I missing something?
–K
[UPDATE]: I was not aware of pl.format()
– @
ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ’s answer should be accepted.
df.apply()
returns a dataframe – you can stack dataframes horizontally:
df.hstack(
df.apply(lambda row: date_field_value_max_query.format(*row))
.rename({"apply": "query"})
)
shape: (2, 4)
┌───────────────┬──────────────┬────────────────────────────┬─────────────────────────────────────┐
│ schema_name | table_name | column_name | query │
│ --- | --- | --- | --- │
│ str | str | str | str │
╞═══════════════╪══════════════╪════════════════════════════╪═════════════════════════════════════╡
│ test_schema | test_table | test_column, test_column_2 | select 'test_schema' as schema_n... │
├───────────────┼──────────────┼────────────────────────────┼─────────────────────────────────────┤
│ test_schema_2 | test_table_2 | test_column | select 'test_schema_2' as schema... │
└───────────────┴──────────────┴────────────────────────────┴─────────────────────────────────────┘
You can add Series objects as columns:
df.with_columns(
(df.apply(lambda row: date_field_value_max_query.format(*row))
.to_series().rename("query"))
)
shape: (2, 4)
┌───────────────┬──────────────┬────────────────────────────┬─────────────────────────────────────┐
│ schema_name | table_name | column_name | query │
│ --- | --- | --- | --- │
│ str | str | str | str │
╞═══════════════╪══════════════╪════════════════════════════╪═════════════════════════════════════╡
│ test_schema | test_table | test_column, test_column_2 | select 'test_schema' as schema_n... │
├───────────────┼──────────────┼────────────────────────────┼─────────────────────────────────────┤
│ test_schema_2 | test_table_2 | test_column | select 'test_schema_2' as schema... │
└───────────────┴──────────────┴────────────────────────────┴─────────────────────────────────────┘
If you are creating the format string – you could perhaps use an expression instead:
df.with_columns((
pl.lit("select '") + pl.col("schema_name") + pl.lit("' as schema_name,")
+ pl.col("table_name") + pl.lit("' as table_name,")
+ pl.lit("greatest(") + pl.col("column_name") + pl.lit(")") +
pl.lit("from ") + pl.col("schema_name") + pl.lit(".") + pl.col("table_name") +
pl.lit("group by 1, 2")
).alias("query"))
shape: (2, 4)
┌───────────────┬──────────────┬────────────────────────────┬─────────────────────────────────────┐
│ schema_name | table_name | column_name | query │
│ --- | --- | --- | --- │
│ str | str | str | str │
╞═══════════════╪══════════════╪════════════════════════════╪═════════════════════════════════════╡
│ test_schema | test_table | test_column, test_column_2 | select 'test_schema' as schema_n... │
├───────────────┼──────────────┼────────────────────────────┼─────────────────────────────────────┤
│ test_schema_2 | test_table_2 | test_column | select 'test_schema_2' as schema... │
└───────────────┴──────────────┴────────────────────────────┴─────────────────────────────────────┘
Another option is to use polars.format
to create your string. For example:
date_field_value_max_query = (
'''select {} as schema_name,
{} as table_name,
greatest({})
from {}.{}
group by 1, 2
'''
)
(
df
.with_columns([
pl.format(date_field_value_max_query,
'schema_name',
'table_name',
'column_name',
'schema_name',
'table_name')
])
)
shape: (2, 4)
┌───────────────┬──────────────┬────────────────────────────┬─────────────────────────────────────────────┐
│ schema_name ┆ table_name ┆ column_name ┆ literal │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞═══════════════╪══════════════╪════════════════════════════╪═════════════════════════════════════════════╡
│ test_schema ┆ test_table ┆ test_column, test_column_2 ┆ select test_schema as schema_name, │
│ ┆ ┆ ┆ test_table as table_name, │
│ ┆ ┆ ┆ greatest(test_column, test_column_2) │
│ ┆ ┆ ┆ from test_schema.test_table │
│ ┆ ┆ ┆ group by 1, 2 │
│ ┆ ┆ ┆ │
│ ┆ ┆ ┆ │
│ test_schema_2 ┆ test_table_2 ┆ test_column ┆ select test_schema_2 as schema_name, │
│ ┆ ┆ ┆ test_table_2 as table_name, │
│ ┆ ┆ ┆ greatest(test_column) │
│ ┆ ┆ ┆ from test_schema_2.test_table_2 │
│ ┆ ┆ ┆ group by 1, 2 │
│ ┆ ┆ ┆ │
│ ┆ ┆ ┆ │
└───────────────┴──────────────┴────────────────────────────┴─────────────────────────────────────────────┘
I have a polars dataframe:
df = pl.DataFrame({'schema_name': ['test_schema', 'test_schema_2'],
'table_name': ['test_table', 'test_table_2'],
'column_name': ['test_column, test_column_2','test_column']})
schema_name | table_name | column_name |
---|---|---|
test_schema | test_table | test_column, test_column_2 |
test_schema_2 | test_table_2 | test_column |
I have a string:
date_field_value_max_query = '''
select '{0}' as schema_name,
'{1}' as table_name,
greatest({2})
from {0}.{1}
group by 1, 2
'''
I would like to use polars to add a column by using string formatting. The target dataframe is this:
schema_name | table_name | column_name | query |
---|---|---|---|
test_schema | test_table | test_column, test_column_2 | select test_schema, test_table, greatest(test_column, test_column_2) from test_schema.test_table group by 1, 2 |
test_schema_2 | test_table_2 | test_column | select test_schema_2, test_table_2, greatest(test_column) from test_schema_2.test_table_2 group by 1, 2 |
In pandas, I would do something like this:
df.apply(lambda row: date_field_value_max_query.format(row['schema_name'], row['table_name'], row['column_name']), axis=1)
For polars, I tried this:
df.with_column(
(date_field_value_max_query.format(pl.col('schema_name'), pl.col('table_name'), pl.col('column_name')))
)
This doesn’t work, because with_column
expects a single expression. I am able to get the output I want by doing this…
df.apply(lambda row: date_field_value_max_query.format(row[0], row[1], row[2]))
…but this returns only the one column, and I lose the original three columns. I know this approach is also not recommended for polars, when possible.
How can I perform string formatting across multiple dataframe columns with the output column attached to the original dataframe?
Couldn’t you just:
df['query'] = df.apply(lambda row: date_field_value_max_query.format(row[0], row[1], row[2]))
Am I missing something?
–K
[UPDATE]: I was not aware of pl.format()
– @
ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ’s answer should be accepted.
df.apply()
returns a dataframe – you can stack dataframes horizontally:
df.hstack(
df.apply(lambda row: date_field_value_max_query.format(*row))
.rename({"apply": "query"})
)
shape: (2, 4)
┌───────────────┬──────────────┬────────────────────────────┬─────────────────────────────────────┐
│ schema_name | table_name | column_name | query │
│ --- | --- | --- | --- │
│ str | str | str | str │
╞═══════════════╪══════════════╪════════════════════════════╪═════════════════════════════════════╡
│ test_schema | test_table | test_column, test_column_2 | select 'test_schema' as schema_n... │
├───────────────┼──────────────┼────────────────────────────┼─────────────────────────────────────┤
│ test_schema_2 | test_table_2 | test_column | select 'test_schema_2' as schema... │
└───────────────┴──────────────┴────────────────────────────┴─────────────────────────────────────┘
You can add Series objects as columns:
df.with_columns(
(df.apply(lambda row: date_field_value_max_query.format(*row))
.to_series().rename("query"))
)
shape: (2, 4)
┌───────────────┬──────────────┬────────────────────────────┬─────────────────────────────────────┐
│ schema_name | table_name | column_name | query │
│ --- | --- | --- | --- │
│ str | str | str | str │
╞═══════════════╪══════════════╪════════════════════════════╪═════════════════════════════════════╡
│ test_schema | test_table | test_column, test_column_2 | select 'test_schema' as schema_n... │
├───────────────┼──────────────┼────────────────────────────┼─────────────────────────────────────┤
│ test_schema_2 | test_table_2 | test_column | select 'test_schema_2' as schema... │
└───────────────┴──────────────┴────────────────────────────┴─────────────────────────────────────┘
If you are creating the format string – you could perhaps use an expression instead:
df.with_columns((
pl.lit("select '") + pl.col("schema_name") + pl.lit("' as schema_name,")
+ pl.col("table_name") + pl.lit("' as table_name,")
+ pl.lit("greatest(") + pl.col("column_name") + pl.lit(")") +
pl.lit("from ") + pl.col("schema_name") + pl.lit(".") + pl.col("table_name") +
pl.lit("group by 1, 2")
).alias("query"))
shape: (2, 4)
┌───────────────┬──────────────┬────────────────────────────┬─────────────────────────────────────┐
│ schema_name | table_name | column_name | query │
│ --- | --- | --- | --- │
│ str | str | str | str │
╞═══════════════╪══════════════╪════════════════════════════╪═════════════════════════════════════╡
│ test_schema | test_table | test_column, test_column_2 | select 'test_schema' as schema_n... │
├───────────────┼──────────────┼────────────────────────────┼─────────────────────────────────────┤
│ test_schema_2 | test_table_2 | test_column | select 'test_schema_2' as schema... │
└───────────────┴──────────────┴────────────────────────────┴─────────────────────────────────────┘
Another option is to use polars.format
to create your string. For example:
date_field_value_max_query = (
'''select {} as schema_name,
{} as table_name,
greatest({})
from {}.{}
group by 1, 2
'''
)
(
df
.with_columns([
pl.format(date_field_value_max_query,
'schema_name',
'table_name',
'column_name',
'schema_name',
'table_name')
])
)
shape: (2, 4)
┌───────────────┬──────────────┬────────────────────────────┬─────────────────────────────────────────────┐
│ schema_name ┆ table_name ┆ column_name ┆ literal │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞═══════════════╪══════════════╪════════════════════════════╪═════════════════════════════════════════════╡
│ test_schema ┆ test_table ┆ test_column, test_column_2 ┆ select test_schema as schema_name, │
│ ┆ ┆ ┆ test_table as table_name, │
│ ┆ ┆ ┆ greatest(test_column, test_column_2) │
│ ┆ ┆ ┆ from test_schema.test_table │
│ ┆ ┆ ┆ group by 1, 2 │
│ ┆ ┆ ┆ │
│ ┆ ┆ ┆ │
│ test_schema_2 ┆ test_table_2 ┆ test_column ┆ select test_schema_2 as schema_name, │
│ ┆ ┆ ┆ test_table_2 as table_name, │
│ ┆ ┆ ┆ greatest(test_column) │
│ ┆ ┆ ┆ from test_schema_2.test_table_2 │
│ ┆ ┆ ┆ group by 1, 2 │
│ ┆ ┆ ┆ │
│ ┆ ┆ ┆ │
└───────────────┴──────────────┴────────────────────────────┴─────────────────────────────────────────────┘