Auto Mapping using python
Question:
I have two Excel sheets that are connected via ID but sometimes the IDs do not match exactly. For example, the ID in Table A could be AA00-123-334 and the ID in Table B is AA0-0123-334 or A00-123-334 or AA00-123334 but in reality they are the same , thenso my inquiry is I need to do the following:
1- in Table B I need to remove the letters (0-0123-334)
2- then I need to remove the dash – (00123334)
3- start counting from the right 3 digits and put a dash (00-123-334)
4- search in Table A if (00-123-334) like any entry in Table A
5- If found, then check the year of the date of birth in Table B if it is the same as year of birth in Table A the set the value of the expected ID in Table B with the Mapped ID FROM table A
This is an example of data
import pandas as pd
import numpy as np
data_A = [['AA00-123-334', '2011-10-10'], ['BB00-123-335', '2012-10-10'], ['CC00-123-336', '2013-10-10'], ['DD00-123-37', '2015-10-10']]
Table_A = pd.DataFrame(data_A, columns=['ID', 'DOB'])
Table_A
data_B = [['AA-00123-334',2011, np.NaN], ['B00123-335', 2012, np.NaN], ['123336', 2013, np.NaN], ['00123-37', 2014, np.NaN]]
Table_B = pd.DataFrame(data_B, columns=['ID', 'Year_ofbirth', 'expected_id'])
Table_B
Noting that i have thousands of entries
Answers:
Your problem can be solved with standard regex expressions and pandas
functions:
import pandas as pd
import numpy as np
data_A = [['AA00-123-334', '2011-10-10'], ['BB00-123-335', '2012-10-10'], ['CC00-123-336', '2013-10-10'], ['DD00-123-37', '2015-10-10']]
Table_A = pd.DataFrame(data_A, columns=['ID', 'DOB'])
Table_A
data_B = [['AA-00123-334',2011, np.NaN], ['B00123-335', 2012, np.NaN], ['123336', 2013, np.NaN], ['00123-37', 2014, np.NaN]]
Table_B = pd.DataFrame(data_B, columns=['ID', 'Year_ofbirth', 'expected_id'])
print(Table_B)
Table_B['ID'] = Table_B['ID'].str.replace('[A-Za-z-]', '', regex=True)
# add dashes back to the IDs in Table B
Table_B['ID'] = Table_B['ID'].str[-9:-6] + '-' + Table_B['ID'].str[-6:-3] + '-' + Table_B['ID'].str[-3:]
# initialize the expected ID column in Table B with NaN
Table_B['expected_id'] = pd.Series([np.nan]*len(Table_B))
# loop through each row in Table B
for i, row in Table_B.iterrows():
# search for matching IDs in Table A
match = Table_A['ID'].str.contains(row['ID'])
# if a match is found, set expected ID in Table B
if match.any():
Table_B.at[i, 'expected_id'] = Table_A.loc[match, 'ID'].values[0]
print(Table_B)
Output:
ID Year_ofbirth expected_id
0 AA-00123-334 2011 NaN
1 B00123-335 2012 NaN
2 123336 2013 NaN
3 00123-37 2014 NaN
ID Year_ofbirth expected_id
0 00-123-334 2011 AA00-123-334
1 00-123-335 2012 BB00-123-335
2 -123-336 2013 CC00-123-336
3 0-012-337 2014 NaN
I have two Excel sheets that are connected via ID but sometimes the IDs do not match exactly. For example, the ID in Table A could be AA00-123-334 and the ID in Table B is AA0-0123-334 or A00-123-334 or AA00-123334 but in reality they are the same , thenso my inquiry is I need to do the following:
1- in Table B I need to remove the letters (0-0123-334)
2- then I need to remove the dash – (00123334)
3- start counting from the right 3 digits and put a dash (00-123-334)
4- search in Table A if (00-123-334) like any entry in Table A
5- If found, then check the year of the date of birth in Table B if it is the same as year of birth in Table A the set the value of the expected ID in Table B with the Mapped ID FROM table A
This is an example of data
import pandas as pd
import numpy as np
data_A = [['AA00-123-334', '2011-10-10'], ['BB00-123-335', '2012-10-10'], ['CC00-123-336', '2013-10-10'], ['DD00-123-37', '2015-10-10']]
Table_A = pd.DataFrame(data_A, columns=['ID', 'DOB'])
Table_A
data_B = [['AA-00123-334',2011, np.NaN], ['B00123-335', 2012, np.NaN], ['123336', 2013, np.NaN], ['00123-37', 2014, np.NaN]]
Table_B = pd.DataFrame(data_B, columns=['ID', 'Year_ofbirth', 'expected_id'])
Table_B
Noting that i have thousands of entries
Your problem can be solved with standard regex expressions and pandas
functions:
import pandas as pd
import numpy as np
data_A = [['AA00-123-334', '2011-10-10'], ['BB00-123-335', '2012-10-10'], ['CC00-123-336', '2013-10-10'], ['DD00-123-37', '2015-10-10']]
Table_A = pd.DataFrame(data_A, columns=['ID', 'DOB'])
Table_A
data_B = [['AA-00123-334',2011, np.NaN], ['B00123-335', 2012, np.NaN], ['123336', 2013, np.NaN], ['00123-37', 2014, np.NaN]]
Table_B = pd.DataFrame(data_B, columns=['ID', 'Year_ofbirth', 'expected_id'])
print(Table_B)
Table_B['ID'] = Table_B['ID'].str.replace('[A-Za-z-]', '', regex=True)
# add dashes back to the IDs in Table B
Table_B['ID'] = Table_B['ID'].str[-9:-6] + '-' + Table_B['ID'].str[-6:-3] + '-' + Table_B['ID'].str[-3:]
# initialize the expected ID column in Table B with NaN
Table_B['expected_id'] = pd.Series([np.nan]*len(Table_B))
# loop through each row in Table B
for i, row in Table_B.iterrows():
# search for matching IDs in Table A
match = Table_A['ID'].str.contains(row['ID'])
# if a match is found, set expected ID in Table B
if match.any():
Table_B.at[i, 'expected_id'] = Table_A.loc[match, 'ID'].values[0]
print(Table_B)
Output:
ID Year_ofbirth expected_id
0 AA-00123-334 2011 NaN
1 B00123-335 2012 NaN
2 123336 2013 NaN
3 00123-37 2014 NaN
ID Year_ofbirth expected_id
0 00-123-334 2011 AA00-123-334
1 00-123-335 2012 BB00-123-335
2 -123-336 2013 CC00-123-336
3 0-012-337 2014 NaN