Replace column values using another pandas dataframe mapping

Question:

Imagine I have the following dirty data of employee information of their contracts across countries (df1):

ID  Country  Name     Job        Date         Grade
1   CZ       John     Office     2021-01-01   Senior
1   SK       John     .          2021-01-01   Assistant
2   AE       Peter    Carpinter  2000-05-03 
3   PE       Marcia   Cleaner    1989-11-11   ERROR!
3   FR       Marcia   Assistant  1978-01-05   High
3   FR       Marcia              1999-01-01   Senior

I need to look into a LOV mapping table and on it, each country have different (or same) LOV columns that would replace the value provided by a code. For each country then, it would check if the column is in the LOV mapping for that country and, if the value exist in the "Values" column, replace to the corresponding code. If not, just leave the same value.

So using this mapping (df2):

Country Field   Values      Code
US      Job     Back        BA
US      Job     Front       FR
US      Job     Office      OFF
CZ      Job     Office      CZ_OFF
CZ      Job     Field       CZ_Fil
SK      Job     All         ALL
FR      Job     Assistant   AST
AE      Job     Carpinter   CAR
AE      Job     Carpinter   CAR
CZ      Grade   Senior      S
CZ      Grade   Junior      J
SK      Grade   M1          M1
FR      Grade   Low         L
FR      Grade   Mid         M1
FR      Grade   High        H

Would result in the following dataframe:

ID  Country     Name    Job     Date        Grade
1   CZ          John    CZ_OFF  2021-01-01  S
1   SK          John    .       2021-01-01  M1
2   AE          Peter   CAR     2000-05-03  
3   PE          Marcia  Cleaner 1989-11-11  ERROR!
3   FR          Marcia  AST     1978-01-05  H
3   FR          Marcia          1999-01-01  Senior

Thank you so much for the support!

Asked By: Paulo Cortez

||

Answers:

Using a melt+merge+pivot+combine_first:

cols = df1.columns.difference(df2['Field'].unique())

out = (
 df1.melt(cols, var_name='Field', value_name='Values', ignore_index=False)
    .reset_index()
    .merge(df2.drop_duplicates())
    .pivot(index=cols.union(['index']), columns='Field', values='Code')
    .reset_index(list(cols))
    .combine_first(df1)[df1.columns]
)

Output:

   ID Country    Name      Job        Date      Grade
0   1      CZ    John   CZ_OFF  2021-01-01          S
1   1      SK    John        .  2021-01-01  Assistant
2   2      AE   Peter      CAR  2000-05-03        NaN
3   3      PE  Marcia  Cleaner  1989-11-11     ERROR!
4   3      FR  Marcia      AST  1978-01-05          H
5   3      FR  Marcia      NaN  1999-01-01     Senior

Alternative using a loop (in place modification of df1), and groupby to facilitate the selection of the rows to merge:

g = df2.groupby('Field')

for c in df1.columns.intersection(df2['Field'].unique()):
    df1[c] = df1.merge(g.get_group(c).drop(columns='Field')
                        .rename(columns={'Values': c}), how='left'
                       )['Code'].fillna(df1[c])
Answered By: mozway
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.