Confirm if LOV columns in a pandas dataframe complies with another mapping dataframe

Question:

I have 2 dataframes, one of them of employee information by Country, and another one with a mapping of possible values for LOV columns per country (depending on the country, the column may or may not be an LOV and accept different values and is case sensitive). Example I have this data belonging to CZ:

Country Employee Name   Contract Type   Grade   Education
    CZ  Jonathan        permanent        1      Male
    CZ  Peter           np.nan           6      male
    CZ  Maro            Fixed            .      N/A
    CZ  Marisa          Contractor  01-01-2020  Female
    CZ  Petra           Permanent        5      Female

And this would be the mapping dataframe:

Country LOV Column  Values
CZ  Contract Type   Permanent
CZ  Contract Type   Fixed
US  Contract Type   Permanent
US  Contract Type   Fixed
US  Contract Type   Contractor
AE  Contract Type   Permanent
CZ  Grade           1
CZ  Grade           2
CZ  Grade           3
CZ  Grade           4
CZ  Grade           5
US  Contract Type   Manager
US  Contract Type   Non Manager
CZ  Education       1st Degree
CZ  Education       2nd Degree
SK  Education       A
SK  Education       B
SK  Education       C
AE  Gender          Male
AE  Gender          Female

I need a way to filter the LOV mapping dataframe for the Country I am analyzing, their LOV columns and possible list of values and then checking against the data which values they provided on those columns that are non compliant to the mapping table (in a list, and if an empty list this would mean all values complied to the mapping) and identify the rows the error occured (also in a list, and again if an empty list there is no error). If the cell is blank, do not consider it as an error.

Example:

Contract Type
Values non compliant in the column: ['permanent','Contractor']
Rows error occured: [0,3]

Education
Values non compliant in the column: ['Male', 'male', 'Female', 'Female']
Rows error occured: [0, 1, 3, 4]

Grade
Values non compliant in the column: ['6', '.', '01-01-2020']
Rows error occured: [1, 2, 3]

Thank you for the support!

Asked By: Paulo Cortez

||

Answers:

You can try:

df = df.replace('np.nan', np.nan)
df_mapping = df_mapping.groupby(['Country', 'LOV Column'])['Values'].agg(set)

out = {}
for idx, row in df.iterrows():
    country = row['Country']
    for i, v in zip(row.index, row):
        if pd.notna(v) and (country, i) in df_mapping and v not in df_mapping[(country, i)]:
            out.setdefault(i, []).append((idx, v))

for k, v in out.items():
    print(k)
    print('Values non compliant in the column:', [i for _, i in v])
    print('Rows error occured:', [i for i, _ in v])
    print()

Prints:

Contract Type
Values non compliant in the column: ['permanent', 'Contractor']
Rows error occured: [0, 3]

Education
Values non compliant in the column: ['Male', 'male', 'Female', 'Female']
Rows error occured: [0, 1, 3, 4]

Grade
Values non compliant in the column: ['6', '.', '01-01-2020']
Rows error occured: [1, 2, 3]
Answered By: Andrej Kesely