Auto Mapping using python

Question:

I have two Excel sheets that are connected via ID but sometimes the IDs do not match exactly. For example, the ID in Table A could be AA00-123-334 and the ID in Table B is AA0-0123-334 or A00-123-334 or AA00-123334 but in reality they are the same , thenso my inquiry is I need to do the following:

1- in Table B I need to remove the letters (0-0123-334)

2- then I need to remove the dash – (00123334)

3- start counting from the right 3 digits and put a dash (00-123-334)

4- search in Table A if (00-123-334) like any entry in Table A

5- If found, then check the year of the date of birth in Table B if it is the same as year of birth in Table A the set the value of the expected ID in Table B with the Mapped ID FROM table A

This is an example of data

import pandas as pd
import numpy as np
 
data_A = [['AA00-123-334', '2011-10-10'], ['BB00-123-335', '2012-10-10'], ['CC00-123-336', '2013-10-10'], ['DD00-123-37', '2015-10-10']]

Table_A = pd.DataFrame(data_A, columns=['ID', 'DOB'])

Table_A

data_B = [['AA-00123-334',2011, np.NaN], ['B00123-335', 2012, np.NaN], ['123336',  2013, np.NaN], ['00123-37',  2014, np.NaN]]

Table_B = pd.DataFrame(data_B, columns=['ID', 'Year_ofbirth', 'expected_id'])

Table_B

Noting that i have thousands of entries

Asked By: Marwa

||

Answers:

Your problem can be solved with standard regex expressions and pandas functions:

import pandas as pd
import numpy as np
 
data_A = [['AA00-123-334', '2011-10-10'], ['BB00-123-335', '2012-10-10'], ['CC00-123-336', '2013-10-10'], ['DD00-123-37', '2015-10-10']]

Table_A = pd.DataFrame(data_A, columns=['ID', 'DOB'])

Table_A

data_B = [['AA-00123-334',2011, np.NaN], ['B00123-335', 2012, np.NaN], ['123336',  2013, np.NaN], ['00123-37',  2014, np.NaN]]

Table_B = pd.DataFrame(data_B, columns=['ID', 'Year_ofbirth', 'expected_id'])

print(Table_B)
Table_B['ID'] = Table_B['ID'].str.replace('[A-Za-z-]', '', regex=True)

# add dashes back to the IDs in Table B
Table_B['ID'] = Table_B['ID'].str[-9:-6] + '-' + Table_B['ID'].str[-6:-3] + '-' + Table_B['ID'].str[-3:]

# initialize the expected ID column in Table B with NaN
Table_B['expected_id'] = pd.Series([np.nan]*len(Table_B))

# loop through each row in Table B
for i, row in Table_B.iterrows():
    # search for matching IDs in Table A
    match = Table_A['ID'].str.contains(row['ID'])
    # if a match is found, set expected ID in Table B
    if match.any():
        Table_B.at[i, 'expected_id'] = Table_A.loc[match, 'ID'].values[0]

print(Table_B)

Output:

             ID  Year_ofbirth  expected_id
0  AA-00123-334          2011          NaN
1    B00123-335          2012          NaN
2        123336          2013          NaN
3      00123-37          2014          NaN
           ID  Year_ofbirth   expected_id
0  00-123-334          2011  AA00-123-334
1  00-123-335          2012  BB00-123-335
2    -123-336          2013  CC00-123-336
3   0-012-337          2014           NaN
Answered By: Caridorc
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.