How to merge two dfs based on a substring of the strings in a column and insert values of another column?
Question:
I have the following dfs:
data
:
ZIP code
urbanisation
1111AA
3916HV
reference
:
ZIP code category
urbanisation
1111
High
3916
Medium
So the urbanisation in my data set is empty and I need to fill it using a measure of urbanisation I found online. I want to:
- Match column
reference["ZIP code category"]
with the first 4 digits of data["ZIP code"]
, but I cannot change the actual ZIP codes. That is, I want to match based on a substring, for example by using data["ZIP code"].str[:4]
.
- For every match paste the corresponding value of
reference["urbanisation"]
in data["urbanisation"]
I tried this as follows:
pd.merge(
data, reference,
left_on=['ZIP code', data["ZIP code"].str[:4]],
right_on=['ZIP code category', reference["ZIP code category"]]
)
However, this code is not correct and I do not know how to produce the desired result.
Answers:
You can create new helper column ZIP code category
by firt 4 values and also ZIP code category
convert to strings (if necessary) and use left join:
df = pd.merge(data.drop('urbanisation',axis=1), reference,
left_on=data["ZIP code"].str[:4],
right_on=reference["ZIP code category"].astype(str)
).drop(['key_0','ZIP code category'], axis=1, errors='ignore')
print (df)
ZIP code urbanisation
0 1111AA High
1 3916HV Medium
You can use join
:
data['urbanisation'] = data['ZIP code'].str[:4].map(
reference.astype({'ZIP code category': str})
.set_index('ZIP code category')['urbanisation'])
print(data)
# Output
ZIP code urbanisation
0 1111AA High
1 3916HV Medium
Another possible solution, which is based on pandas.DataFrame.update
:
out = df1.set_index(df1['ZIP code'].str.replace(r'D', ''))
out.update(df2.set_index(df2['ZIP code category'].astype(str)))
out.reset_index(drop=True)
Output:
ZIP code urbanisation
0 1111AA High
1 3916HV Medium
I have the following dfs:
data
:
ZIP code | urbanisation |
---|---|
1111AA | |
3916HV |
reference
:
ZIP code category | urbanisation |
---|---|
1111 | High |
3916 | Medium |
So the urbanisation in my data set is empty and I need to fill it using a measure of urbanisation I found online. I want to:
- Match column
reference["ZIP code category"]
with the first 4 digits ofdata["ZIP code"]
, but I cannot change the actual ZIP codes. That is, I want to match based on a substring, for example by usingdata["ZIP code"].str[:4]
. - For every match paste the corresponding value of
reference["urbanisation"]
indata["urbanisation"]
I tried this as follows:
pd.merge(
data, reference,
left_on=['ZIP code', data["ZIP code"].str[:4]],
right_on=['ZIP code category', reference["ZIP code category"]]
)
However, this code is not correct and I do not know how to produce the desired result.
You can create new helper column ZIP code category
by firt 4 values and also ZIP code category
convert to strings (if necessary) and use left join:
df = pd.merge(data.drop('urbanisation',axis=1), reference,
left_on=data["ZIP code"].str[:4],
right_on=reference["ZIP code category"].astype(str)
).drop(['key_0','ZIP code category'], axis=1, errors='ignore')
print (df)
ZIP code urbanisation
0 1111AA High
1 3916HV Medium
You can use join
:
data['urbanisation'] = data['ZIP code'].str[:4].map(
reference.astype({'ZIP code category': str})
.set_index('ZIP code category')['urbanisation'])
print(data)
# Output
ZIP code urbanisation
0 1111AA High
1 3916HV Medium
Another possible solution, which is based on pandas.DataFrame.update
:
out = df1.set_index(df1['ZIP code'].str.replace(r'D', ''))
out.update(df2.set_index(df2['ZIP code category'].astype(str)))
out.reset_index(drop=True)
Output:
ZIP code urbanisation
0 1111AA High
1 3916HV Medium