How to merge two dfs based on a substring of the strings in a column and insert values of another column?

Question:

I have the following dfs:

data:

ZIP code urbanisation
1111AA
3916HV

reference:

ZIP code category urbanisation
1111 High
3916 Medium

So the urbanisation in my data set is empty and I need to fill it using a measure of urbanisation I found online. I want to:

  • Match column reference["ZIP code category"] with the first 4 digits of data["ZIP code"], but I cannot change the actual ZIP codes. That is, I want to match based on a substring, for example by using data["ZIP code"].str[:4].
  • For every match paste the corresponding value of reference["urbanisation"] in data["urbanisation"]

I tried this as follows:

pd.merge(
    data, reference,
    left_on=['ZIP code', data["ZIP code"].str[:4]],
    right_on=['ZIP code category', reference["ZIP code category"]]
)

However, this code is not correct and I do not know how to produce the desired result.

Asked By: Xtiaan

||

Answers:

You can create new helper column ZIP code category by firt 4 values and also ZIP code category convert to strings (if necessary) and use left join:

df = pd.merge(data.drop('urbanisation',axis=1), reference,
              left_on=data["ZIP code"].str[:4],
              right_on=reference["ZIP code category"].astype(str)
               ).drop(['key_0','ZIP code category'], axis=1, errors='ignore')
print (df)
  ZIP code urbanisation
0   1111AA         High
1   3916HV       Medium
Answered By: jezrael

You can use join:

data['urbanisation'] = data['ZIP code'].str[:4].map(
                           reference.astype({'ZIP code category': str})
                                    .set_index('ZIP code category')['urbanisation'])
print(data)

# Output
  ZIP code urbanisation
0   1111AA         High
1   3916HV       Medium
Answered By: Corralien

Another possible solution, which is based on pandas.DataFrame.update:

out = df1.set_index(df1['ZIP code'].str.replace(r'D', ''))
out.update(df2.set_index(df2['ZIP code category'].astype(str)))
out.reset_index(drop=True)

Output:

  ZIP code urbanisation
0   1111AA         High
1   3916HV       Medium
Answered By: PaulS
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.