Pandas Dataframe: Get and Edit all values in a column containing substring
Question:
Lets say I have a dataframe, called stores, like this one:
country
store_name
FR
my new tmp
ES
this Tmp is new
FR
walmart
ES
Target
FR
TMP
and another dataframe, called replacements, like this one:
country
original
replacement
ES
TMP
STORE
FR
TMP
STORE
FR
WALMART
IGNORE
How would you go about getting and updating all values in the store_name column of the first dataframe according to the "rules" of the second one, when the substring in the original column is found (ignoring lower/upper case)?
For this example i’d like to get a new dataframe like this:
country
store_name
FR
my new STORE
ES
this STORE is new
FR
IGNORE
ES
Target
FR
STORE
I was thinking something like iterating the second dataframe and apply the change to the first one, like this:
for index, row in replacements.iterrows():
stores['store_name'] = stores['store_name'].str.upper().replace(row["original"].upper(), row["replacement"])
It kind of works, but it’s doing some weird things like not changing some strings. Also, I’m not sure if this is the optimal way of doing this. Any suggestions?
Reproducible inputs:
data = [['FR', 'my new tmp'], ['ES', 'this Tmp is new'], ['FR', 'walmart'], ['ES', 'Target'], ['FR', 'TMP']]
df1 = pd.DataFrame(data, columns=['country', 'store_name'])
data = [['ES', 'TMP','STORE'], ['FR', 'TMP','STORE'], ['FR', 'WALMART','IGNORE']]
df2 = pd.DataFrame(data, columns=['country', 'store_name','replacement'])
Answers:
Assuming df1
and df2
, you can use a crafted regex within groupby.apply
:
import re
s = df2.set_index(['country', 'store_name'])['replacement']
df1['store_name'] = (
df1.groupby('country', group_keys=False)
.apply(lambda g: g['store_name'].str.replace(f"({'|'.join(map(re.escape, s[g.name].index))})", lambda m: s[(g.name, m.group().upper())], regex=True, flags=re.I))
)
print(df1)
Output:
country store_name
0 FR my new STORE
1 ES this STORE is new
2 FR IGNORE
3 ES Target
4 FR STORE
If obtaining a new dataframe as result is acceptable consider the following approach implying outer join of 2 initial dfs, grouping and regex replacement based on first found match within a group and successful replacement:
import re
def f(x):
for r in x.itertuples(index=False):
store_name, subs = re.subn(r.store_name_y, r.replacement, r.store_name_x, flags=re.I)
if subs == 1: # if there was successful replacement
return store_name # return the result immediately
else:
return r.store_name_x
res_df = df1.merge(df2, on='country', how='outer')
.groupby(['country', 'store_name_x'], sort=False)
.apply(f).droplevel(1).reset_index(name='store_name')
country store_name
0 FR my new STORE
1 FR IGNORE
2 FR STORE
3 ES this STORE is new
4 ES Target
Lets say I have a dataframe, called stores, like this one:
country | store_name |
---|---|
FR | my new tmp |
ES | this Tmp is new |
FR | walmart |
ES | Target |
FR | TMP |
and another dataframe, called replacements, like this one:
country | original | replacement |
---|---|---|
ES | TMP | STORE |
FR | TMP | STORE |
FR | WALMART | IGNORE |
How would you go about getting and updating all values in the store_name column of the first dataframe according to the "rules" of the second one, when the substring in the original column is found (ignoring lower/upper case)?
For this example i’d like to get a new dataframe like this:
country | store_name |
---|---|
FR | my new STORE |
ES | this STORE is new |
FR | IGNORE |
ES | Target |
FR | STORE |
I was thinking something like iterating the second dataframe and apply the change to the first one, like this:
for index, row in replacements.iterrows():
stores['store_name'] = stores['store_name'].str.upper().replace(row["original"].upper(), row["replacement"])
It kind of works, but it’s doing some weird things like not changing some strings. Also, I’m not sure if this is the optimal way of doing this. Any suggestions?
Reproducible inputs:
data = [['FR', 'my new tmp'], ['ES', 'this Tmp is new'], ['FR', 'walmart'], ['ES', 'Target'], ['FR', 'TMP']]
df1 = pd.DataFrame(data, columns=['country', 'store_name'])
data = [['ES', 'TMP','STORE'], ['FR', 'TMP','STORE'], ['FR', 'WALMART','IGNORE']]
df2 = pd.DataFrame(data, columns=['country', 'store_name','replacement'])
Assuming df1
and df2
, you can use a crafted regex within groupby.apply
:
import re
s = df2.set_index(['country', 'store_name'])['replacement']
df1['store_name'] = (
df1.groupby('country', group_keys=False)
.apply(lambda g: g['store_name'].str.replace(f"({'|'.join(map(re.escape, s[g.name].index))})", lambda m: s[(g.name, m.group().upper())], regex=True, flags=re.I))
)
print(df1)
Output:
country store_name
0 FR my new STORE
1 ES this STORE is new
2 FR IGNORE
3 ES Target
4 FR STORE
If obtaining a new dataframe as result is acceptable consider the following approach implying outer join of 2 initial dfs, grouping and regex replacement based on first found match within a group and successful replacement:
import re
def f(x):
for r in x.itertuples(index=False):
store_name, subs = re.subn(r.store_name_y, r.replacement, r.store_name_x, flags=re.I)
if subs == 1: # if there was successful replacement
return store_name # return the result immediately
else:
return r.store_name_x
res_df = df1.merge(df2, on='country', how='outer')
.groupby(['country', 'store_name_x'], sort=False)
.apply(f).droplevel(1).reset_index(name='store_name')
country store_name
0 FR my new STORE
1 FR IGNORE
2 FR STORE
3 ES this STORE is new
4 ES Target