Best way to lookup a value with pandas based on multiple column values

Question:

df1:
  UUID Street  Number City Munic
0          S1       1   C1    M1
1          S2      2A   C2    M2
2          S3       3   C3    M3
3          S3       3   C3    M8
4          S1       1   C1    M1

lkp:
  UUID Street  Number City Munic
0   U1     S1       1   C1    M1
1   U2     S2      2A   C2    M2
2   U3     S3       3   C3    M3
:
:

Hi,
I have two dataframes as above where lkp contains more than 300.000 rows, each row is unique.
df1 can be anything between 1k and 50k rows, and there can be duplicates rows.

From df1 I need to take Street, Number, City and Munic and check if the same "row" (i.e same values) is in the lkp dataframe. If so, then update df1’s UUID value with the corresponding UUID value from lkp.
If I can’t find a match in lkp, then write this to a "missing" file (or dataframe)
So the above example should result in the following:


  UUID Street  Number City Munic
0   U1     S1       1   C1    M1
1   U2     S2      2A   C2    M2
2   U3     S3       3   C3    M3
4   U1     S1       1   C1    M1

and one line in the "nomatch" file
S3 3 C3 M8

I’ve manage to do this by
– looping over each row in df1 using .iterows (I know it’s not the most optimal way)
– lookup the df1 "row" in lkp where Street, Number, City and Munic is equal and get the UUID value
– if match is found store UUID value + df1 row in a csv-file, otherwise write the df1 row to a nomatch file
– read in the csv-file to a new dataframe

but my solution is not so "pythonic" and it takes long time to run (+4 min for a df1 with 4k rows)
Can anyone suggest a faster solution?

Regards

Asked By: HoBe

||

Answers:

Code

merge and chk nan

tmp = df1.drop('UUID', axis=1).merge(lkp, how='left')
cond1 = tmp['UUID'].isna()
out = tmp.loc[~cond1, df1.columns]

out:

    UUID    Street  Number  City    Munic
0   U1      S1      1       C1      M1
1   U2      S2      2A      C2      M2
2   U3      S3      3       C3      M3
4   U1      S1      1       C1      M1

nomatch = df1[cond1]

nomatch:

    UUID    Street  Number  City    Munic
3   NaN     S3      3       C3      M8

Example Code

import pandas as pd
nan = float('nan')
data1 = {'UUID': [nan, nan, nan, nan, nan], 'Street': ['S1', 'S2', 'S3', 'S3', 'S1'], 
         'Number': ['1', '2A', '3', '3', '1'], 'City': ['C1', 'C2', 'C3', 'C3', 'C1'], 
         'Munic': ['M1', 'M2', 'M3', 'M8', 'M1']}

data2 = {'UUID': ['U1', 'U2', 'U3'], 'Street': ['S1', 'S2', 'S3'], 
         'Number': ['1', '2A', '3'], 'City': ['C1', 'C2', 'C3'], 
         'Munic': ['M1', 'M2', 'M3']}

df1 = pd.DataFrame(data1)
lkp = pd.DataFrame(data2)
Answered By: Panda Kim
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.