Best way to lookup a value with pandas based on multiple column values
Question:
df1:
UUID Street Number City Munic
0 S1 1 C1 M1
1 S2 2A C2 M2
2 S3 3 C3 M3
3 S3 3 C3 M8
4 S1 1 C1 M1
lkp:
UUID Street Number City Munic
0 U1 S1 1 C1 M1
1 U2 S2 2A C2 M2
2 U3 S3 3 C3 M3
:
:
Hi,
I have two dataframes as above where lkp contains more than 300.000 rows, each row is unique.
df1 can be anything between 1k and 50k rows, and there can be duplicates rows.
From df1 I need to take Street, Number, City and Munic and check if the same "row" (i.e same values) is in the lkp dataframe. If so, then update df1’s UUID value with the corresponding UUID value from lkp.
If I can’t find a match in lkp, then write this to a "missing" file (or dataframe)
So the above example should result in the following:
UUID Street Number City Munic
0 U1 S1 1 C1 M1
1 U2 S2 2A C2 M2
2 U3 S3 3 C3 M3
4 U1 S1 1 C1 M1
and one line in the "nomatch" file
S3 3 C3 M8
I’ve manage to do this by
– looping over each row in df1 using .iterows (I know it’s not the most optimal way)
– lookup the df1 "row" in lkp where Street, Number, City and Munic is equal and get the UUID value
– if match is found store UUID value + df1 row in a csv-file, otherwise write the df1 row to a nomatch file
– read in the csv-file to a new dataframe
but my solution is not so "pythonic" and it takes long time to run (+4 min for a df1 with 4k rows)
Can anyone suggest a faster solution?
Regards
Answers:
Code
merge
and chk nan
tmp = df1.drop('UUID', axis=1).merge(lkp, how='left')
cond1 = tmp['UUID'].isna()
out = tmp.loc[~cond1, df1.columns]
out:
UUID Street Number City Munic
0 U1 S1 1 C1 M1
1 U2 S2 2A C2 M2
2 U3 S3 3 C3 M3
4 U1 S1 1 C1 M1
nomatch = df1[cond1]
nomatch:
UUID Street Number City Munic
3 NaN S3 3 C3 M8
Example Code
import pandas as pd
nan = float('nan')
data1 = {'UUID': [nan, nan, nan, nan, nan], 'Street': ['S1', 'S2', 'S3', 'S3', 'S1'],
'Number': ['1', '2A', '3', '3', '1'], 'City': ['C1', 'C2', 'C3', 'C3', 'C1'],
'Munic': ['M1', 'M2', 'M3', 'M8', 'M1']}
data2 = {'UUID': ['U1', 'U2', 'U3'], 'Street': ['S1', 'S2', 'S3'],
'Number': ['1', '2A', '3'], 'City': ['C1', 'C2', 'C3'],
'Munic': ['M1', 'M2', 'M3']}
df1 = pd.DataFrame(data1)
lkp = pd.DataFrame(data2)
df1:
UUID Street Number City Munic
0 S1 1 C1 M1
1 S2 2A C2 M2
2 S3 3 C3 M3
3 S3 3 C3 M8
4 S1 1 C1 M1
lkp:
UUID Street Number City Munic
0 U1 S1 1 C1 M1
1 U2 S2 2A C2 M2
2 U3 S3 3 C3 M3
:
:
Hi,
I have two dataframes as above where lkp contains more than 300.000 rows, each row is unique.
df1 can be anything between 1k and 50k rows, and there can be duplicates rows.
From df1 I need to take Street, Number, City and Munic and check if the same "row" (i.e same values) is in the lkp dataframe. If so, then update df1’s UUID value with the corresponding UUID value from lkp.
If I can’t find a match in lkp, then write this to a "missing" file (or dataframe)
So the above example should result in the following:
UUID Street Number City Munic
0 U1 S1 1 C1 M1
1 U2 S2 2A C2 M2
2 U3 S3 3 C3 M3
4 U1 S1 1 C1 M1
and one line in the "nomatch" file
S3 3 C3 M8
I’ve manage to do this by
– looping over each row in df1 using .iterows (I know it’s not the most optimal way)
– lookup the df1 "row" in lkp where Street, Number, City and Munic is equal and get the UUID value
– if match is found store UUID value + df1 row in a csv-file, otherwise write the df1 row to a nomatch file
– read in the csv-file to a new dataframe
but my solution is not so "pythonic" and it takes long time to run (+4 min for a df1 with 4k rows)
Can anyone suggest a faster solution?
Regards
Code
merge
and chk nan
tmp = df1.drop('UUID', axis=1).merge(lkp, how='left')
cond1 = tmp['UUID'].isna()
out = tmp.loc[~cond1, df1.columns]
out:
UUID Street Number City Munic
0 U1 S1 1 C1 M1
1 U2 S2 2A C2 M2
2 U3 S3 3 C3 M3
4 U1 S1 1 C1 M1
nomatch = df1[cond1]
nomatch:
UUID Street Number City Munic
3 NaN S3 3 C3 M8
Example Code
import pandas as pd
nan = float('nan')
data1 = {'UUID': [nan, nan, nan, nan, nan], 'Street': ['S1', 'S2', 'S3', 'S3', 'S1'],
'Number': ['1', '2A', '3', '3', '1'], 'City': ['C1', 'C2', 'C3', 'C3', 'C1'],
'Munic': ['M1', 'M2', 'M3', 'M8', 'M1']}
data2 = {'UUID': ['U1', 'U2', 'U3'], 'Street': ['S1', 'S2', 'S3'],
'Number': ['1', '2A', '3'], 'City': ['C1', 'C2', 'C3'],
'Munic': ['M1', 'M2', 'M3']}
df1 = pd.DataFrame(data1)
lkp = pd.DataFrame(data2)