Python Polars – how to replace strings in a df column with lists with values from dictionary?
Question:
This is a follow up to a question that previously answered.
Have a large dataframe df that looks like this (list in column ‘SKU’)
| SKU | Count | Percent
|----------------------------------------------------------------------|-------|-------------|
| "('000000009100000749',)" | 110 | 0.029633621 |
| "('000000009100000749', '000000009100000776')" | 1 | 0.000269397 |
| "('000000009100000749', '000000009100000776', '000000009100002260')" | 1 | 0.000269397 |
| "('000000009100000749', '000000009100000777', '000000009100002260')" | 1 | 0.000269397 |
| "('000000009100000749', '000000009100000777', '000000009100002530')" | 1 | 0.000269397 |
Need to replace the values in the ‘SKU’ column with corresponding values from a dictionary df_unique that looks like this (please ignore format below, it is a dict):
skus str
code i64
000000009100000749
1
000000009100000785
2
000000009100002088
3
I have tried this code:
replacements = pl.col("SKU")
for old, new in df_unique.items():
replacements = replacements.str.replace_all(old, new)
df = df.select(replacements)
Get this error:
SchemaError: Series of dtype: List(Utf8) != Utf8
I have tried to change the column values to string, alhtough I think it is redundant, but same error
df= df.with_column(
pl.col('SKU').apply(lambda row: [str(x) for x in row])
)
Any guidance on what I am doing wrong?
Answers:
Column SKU
has list[str]
dtype, but next you calling attribute .str
(here: replacements.str.replace_all(old, new)
) which is for string. You should use attribute .arr
with columns that have list
dtype and corresponding methods.
You can use sol-n below with .apply()
or use sol-n by jqurious which works much faster (because .arr.eval()
allows to run all expression parallel)
d = {"000000009100000749": 1, "000000009100000776": 2}
df = pl.DataFrame({
"SKU": [["000000009100000749", "000000009100000776"]]
})
df = df.with_column(
col("SKU").apply(
lambda row: [d[i] for i in row]
).alias("SKU_replaced")
)
It would help if you showed the actual list type of the column:
It looks like you have "stringified" tuples but it’s not entirely clear.
df = pl.DataFrame({
"SKU": [["000000009100000749"], ["000000009100000749", "000000009100000776"]]
})
sku_to_code = {
"000000009100000749": 1,
"000000009100000785": 2,
"000000009100002088": 3
}
>>> df
shape: (2, 1)
┌─────────────────────────────────────┐
│ SKU │
│ --- │
│ list[str] │
╞═════════════════════════════════════╡
│ ["000000009100000749"] │
├─────────────────────────────────────┤
│ ["000000009100000749", "00000000... │
└─────────────────────────────────────┘
When dealing with list columns – .arr.eval()
can be used to run expressions against each element in the list.
pl.element()
is used to refer to each individual element:
replace_sku = pl.element()
for old, new in df_unique.items():
replace_sku = replace_sku.str.replace_all(old, str(new), literal=True)
>>> df.select(pl.col("SKU").arr.eval(replace_sku, parallel=True))
shape: (2, 1)
┌─────────────────────────────┐
│ SKU │
│ --- │
│ list[str] │
╞═════════════════════════════╡
│ ["1"] │
├─────────────────────────────┤
│ ["1", "000000009100000776"] │
└─────────────────────────────┘
Both solutions from jqurious and glebcom above work perfectly for the asked question.
I had not realized that df_unique is a list of dictionaries and not a dict and had to tweak the solution according. Here is the slightly modified solution from jqurious looks like (change the loop to iterate over the elements in the df_unique list of dicts):
replace_sku = pl.element()
for item in df_unique:
old = item['SKU']
new = item['code']
replace_sku = replace_sku.str.replace_all(old, str(new), literal=True)
df = df.select(pl.col("SKU").arr.eval(replace_sku, parallel=True))
This is a follow up to a question that previously answered.
Have a large dataframe df that looks like this (list in column ‘SKU’)
| SKU | Count | Percent
|----------------------------------------------------------------------|-------|-------------|
| "('000000009100000749',)" | 110 | 0.029633621 |
| "('000000009100000749', '000000009100000776')" | 1 | 0.000269397 |
| "('000000009100000749', '000000009100000776', '000000009100002260')" | 1 | 0.000269397 |
| "('000000009100000749', '000000009100000777', '000000009100002260')" | 1 | 0.000269397 |
| "('000000009100000749', '000000009100000777', '000000009100002530')" | 1 | 0.000269397 |
Need to replace the values in the ‘SKU’ column with corresponding values from a dictionary df_unique that looks like this (please ignore format below, it is a dict):
skus str | code i64 |
---|---|
000000009100000749 | 1 |
000000009100000785 | 2 |
000000009100002088 | 3 |
I have tried this code:
replacements = pl.col("SKU")
for old, new in df_unique.items():
replacements = replacements.str.replace_all(old, new)
df = df.select(replacements)
Get this error:
SchemaError: Series of dtype: List(Utf8) != Utf8
I have tried to change the column values to string, alhtough I think it is redundant, but same error
df= df.with_column(
pl.col('SKU').apply(lambda row: [str(x) for x in row])
)
Any guidance on what I am doing wrong?
Column SKU
has list[str]
dtype, but next you calling attribute .str
(here: replacements.str.replace_all(old, new)
) which is for string. You should use attribute .arr
with columns that have list
dtype and corresponding methods.
You can use sol-n below with .apply()
or use sol-n by jqurious which works much faster (because .arr.eval()
allows to run all expression parallel)
d = {"000000009100000749": 1, "000000009100000776": 2}
df = pl.DataFrame({
"SKU": [["000000009100000749", "000000009100000776"]]
})
df = df.with_column(
col("SKU").apply(
lambda row: [d[i] for i in row]
).alias("SKU_replaced")
)
It would help if you showed the actual list type of the column:
It looks like you have "stringified" tuples but it’s not entirely clear.
df = pl.DataFrame({
"SKU": [["000000009100000749"], ["000000009100000749", "000000009100000776"]]
})
sku_to_code = {
"000000009100000749": 1,
"000000009100000785": 2,
"000000009100002088": 3
}
>>> df
shape: (2, 1)
┌─────────────────────────────────────┐
│ SKU │
│ --- │
│ list[str] │
╞═════════════════════════════════════╡
│ ["000000009100000749"] │
├─────────────────────────────────────┤
│ ["000000009100000749", "00000000... │
└─────────────────────────────────────┘
When dealing with list columns – .arr.eval()
can be used to run expressions against each element in the list.
pl.element()
is used to refer to each individual element:
replace_sku = pl.element()
for old, new in df_unique.items():
replace_sku = replace_sku.str.replace_all(old, str(new), literal=True)
>>> df.select(pl.col("SKU").arr.eval(replace_sku, parallel=True))
shape: (2, 1)
┌─────────────────────────────┐
│ SKU │
│ --- │
│ list[str] │
╞═════════════════════════════╡
│ ["1"] │
├─────────────────────────────┤
│ ["1", "000000009100000776"] │
└─────────────────────────────┘
Both solutions from jqurious and glebcom above work perfectly for the asked question.
I had not realized that df_unique is a list of dictionaries and not a dict and had to tweak the solution according. Here is the slightly modified solution from jqurious looks like (change the loop to iterate over the elements in the df_unique list of dicts):
replace_sku = pl.element()
for item in df_unique:
old = item['SKU']
new = item['code']
replace_sku = replace_sku.str.replace_all(old, str(new), literal=True)
df = df.select(pl.col("SKU").arr.eval(replace_sku, parallel=True))