how to compare each cell of dataframe with list of dictionary in python?

Question

I am trying to compare column values of each rows of dataframe with predefined list of dictionary, and do filtering. I tried pandas to compare column value by row-wise with list of dictionary, but it is not quite working, I got type error. I think I may need to convert dataframe into dictionary then compare it with list of dictionary then convert back to dataframe with new column added, but this still not giving my desired output. Does anyone suggest possible workaround on this? How can we do this easily in python

working minimal example

import pandas as pd

indf=pd.DataFrame.from_dict(indf_dict)

indf_lst=indf.to_dict(orient='records')

matches=[]
for each in rules_list:
    for row in indf_lst:
        if row in each:
            matches.append(row)

I tried pandas approach to check column values of every rows in rules_list but the attempt is not successful. Now I tried to convert indf dataframe to dictionary and compare two dictionary, but I have type error as follow:

TypeError                                 Traceback (most recent call last)
Input In [11], in <cell line: 12>()
     12 for each in rules_list:
     13     for row in indf_lst:
---> 14         if row in each:
     15             matches.append(row)

TypeError: unhashable type: 'dict'

objective

I need to compare columns of every rows with list of dictionary rules_list, and add new column which shows found match or not. How this can be done in python?

updated desired output

here is my desired output where I want to add two new columns when columns values hit match with list of dictionary rules_list that I defined.

output={'code0':{0:('5'),1:'nan',2:('98'),3:('98'),4:'nan',5:('15'),6:('40'),7:('52'),8:('52'),9:('40'),10:('52'),11:('52'),12:('58')},'code1':{0:('Agr','Serv'),1:('VA','HC','NIH','SAP','AUS','HOL','ATT','COL','UCL'),2:('ATT','NC'),3:('ATT','VA','NC'),4:('VA','HC','NIH','ATT','COL','UCL'),5:('Agr'),6:'nan',7:('NC'),8:('NC'),9:('VA'),10:('NC'),11:('NC'),12:('CE')},'code2':{0:'nan',1:'nan',2:('103','104','105','106','31'),3:('104','105'),4:'nan',5:('5'),6:'nan',7:('109'),8:('109'),9:('11'),10:('109'),11:('109'),12:('109')},'code3':{0:('90'),1:'nan',2:('810'),3:('810'),4:'nan',5:('58'),6:('518'),7:('610','620','682','642','621','611'),8:('620','682','642','611'),9:('113','174','131','115'),10:('612','790','110'),11:('612','110'),12:('423','114')},'code4':{0:('1'),1:'nan',2:('computerscience'),3:('computerscience'),4:'nan',5:('fishing'),6:'nan',7:('biology'),8:('biology'),9:'nan',10:('biology'),11:('biology'),12:'nan'},'code5':{0:'nan',1:'nan',2:'nan',3:'nan',4:'nan',5:'nan',6:'nan',7:'nan',8:'nan',9:('11','19','31'),10:('12','16','18','19'),11:('12','18','19'),12:('31')},'code6':{0:'nan',1:'nan',2:'nan',3:'nan',4:'nan',5:'nan',6:('594'),7:('712','479','297','639','452','172'),8:('712','479','297'),9:('164','157','388','158'),10:('285','295','236','239','269','284','237'),11:('285','295','237'),12:('372','238')},'isHit':{0:False,1:True,2:True,3:True,4:True,5:False,6:True,7:True,8:True,9:True,10:True,11:True,12:True},'rules_desc':{0:'None',1:'rules1',2:'rules2',3:'rules2',4:'rules1',5:'None',6:'rules12',7:'rules21',8:'rules21',9:'rules4',10:'rules3',11:'rules3',12:'rules5'}}

outdf=pd.DataFrame.from_dict(output)

how can I achieve this sort of mapping value from each cell of dataframe to list of dictionary? should I handle this in pandas or convert them into list then compare it? any possible thoughts? Anything close to above desired output should be fine.

Asked By: Hamilton

||

Source

Answer 1

Perhaps this will get you started. The only tricky thing here is the all function. What I’m saying here is, "for every key and value in this particular rule, if the value is found in the list of values for the corresponding key in our data row, and that’s true for EVERY part of this rule, then it is a winner".

When you have nested data like this, pandas is not the right tool. You could probably make it work, but this is way easier.

A key point here is that you need to search the VALUES in your data dictionary. Right? You have {0:'5',2:'98'...}, but we don’t care about 0 and 2. We only care about the strings.

for row in indf_dict:
    for rno,rule in enumerate(rules_list):
        print("New rule", rno)
        match = all( val in row[key].values() for key,val in rule.items() if key in row)
        if match:
            print("Rule", rno, "matches")

Output:

New rule 0
Rule 0 matches
New rule 1
Rule 1 matches
New rule 2
Rule 2 matches
New rule 3
New rule 4
Rule 4 matches
New rule 5
New rule 6
Rule 6 matches
New rule 7
New rule 8
Rule 8 matches
New rule 9
Rule 9 matches

Answered By: Tim Roberts

Answer 2

The code below should do what you are asking for, but I haven’t tested it yet if it actually really does what it should. I have put some effort in appropriate naming of the variables to make it easier to understand what the code does and how it works.

In the first step the code transforms the list with dictionaries for the rules into a list of tuples with code and code value for each of the rules with the purpose of making the final loop for checking if there is a hit easier to put together, understand, maintain and debug.

In the second step the code transforms the dictionary with data using pandas like it is done in code mentioned in the question.

Probably there is also a pandas way of transforming the list of dictionaries in the first step, so if you read this and know how to accomplish this using pandas I would be glad to hear about that.

Maybe there is a way to accomplish the entire task using pandas and two or three lines of code … now with the variable naming and the provided code of the loops it would be easier for you who is reading this to come up with the code and provide maybe another and better answer.

from pprint import pprint
import pandas as pd
from collections import defaultdict
# ----------------------------------------------------------------------
rules_list=rules_dict=[{'code1':('VA','HC','NIH','SAP','AUS','HOL','ATT','COL','UCL'),'rules_desc':'rules1'},{'code0':('40'),'code3':('518'),'code6':('594'),'rules_desc':'rules12'},{'code0':('98'),'code1':('ATT','NC'),'code2':('103','104','105','106','31'),'code3':('810'),'code4':('computerscience'),'rules_desc':'rules2'},{'code0':('98'),'code1':('ATT','VA','NC'),'code2':('104','105','106','31'),'code4':('computerscience'),'rules_desc':'rules2'},{'code0':('52'),'code1':('NC'),'code2':('109'),'code3':('610','620','682','642','621','611'),'code4':('biology'),'code6':('712','479','297','639','452','172'),'rules_desc':'rules2'},{'code0':('52'),'code1':('NC'),'code2':('109'),'code3':('396','340','394','393','240'),'code4':('biology'),'code5':('12','18'),'rules_desc':'rules2'},{'code0':('52'),'code1':('NC'),'code2':('109'),'code3':('612','790','110'),'code4':('biology'),'code5':('12','16','18','19'),'code6':('285','295','236','239','269','284','237'),'rules_desc':'rules3'},{'code0':('52'),'code1':('NC'),'code2':('109'),'code3':('730','320','350','379','812','374'),'code4':('biology'),'code5':('12','18','19'),'rules_desc':'rules3'},{'code0':('40'),'code1':('VA'),'code2':('11'),'code3':('113','174','131','115'),'code5':('11','19','31'),'code6':('164','157','388','158'),'rules_desc':'rules4'},{'code0':('58'),'code1':('CE'),'code2':('109'),'code3':('423','114'),'code5':('31'),'code6':('372','238'),'rules_desc':'rules5'}]
# codeNname     : 'code1', 'code2', 'code3', ..., 'code6'
# ruleNname     : 'rules1', 'rules12', 'rules2', ..., 'rules5'
# ruleDescrKey  : 'rules_desc'
# dictRulesSpec : dictionary { codeNname:value {1,N} ... , rulesDct_ruleKey:ruleNname }
# dictCodes     : dictionary { codeNname:value, codeNname:value, ... }
# Rules         : List [ dictRulesSpec, dictRulesSpec, ... ]
# dictRules     : { ruleNname:[codeNname, codeNnameValue], ...  }
Rules = rules_list
ruleDescrKey = 'rules_desc'
dictRules    = defaultdict(list)
for dictRulesSpec in Rules:
    ruleNname = dictRulesSpec.pop(ruleDescrKey)
    # dictRulesSpec without ruleDescrKey item has only Codes as keys, so:
    dictCodes = dictRulesSpec 
    for codeNname, codeNnameValue in dictCodes.items(): 
        dictRules[ruleNname].append( (codeNname, codeNnameValue) ) 
print(f'{Rules=}')
print(f'{dictRules=}')
print(' ------------- ')
# ----------------------------------------------------------------------
indf_dict={'code0':{0:('5'),1:'nan',2:('98'),3:('98'),4:'',5:('15'),6:('40'),7:('52'),8:('52'),9:('40'),10:('52'),11:('52'),12:('58')},'code1':{0:('Agr','Serv'),1:('VA','HC','NIH','SAP','AUS','HOL','ATT','COL','UCL'),2:('ATT','NC'),3:('ATT','VA','NC'),4:('VA','HC','NIH','ATT','COL','UCL'),5:('Agr'),6:'nan',7:('NC'),8:('NC'),9:('VA'),10:('NC'),11:('NC'),12:('CE')},'code2':{0:'nan',1:'nan',2:('103','104','105','106','31'),3:('104','105'),4:'nan',5:('5'),6:'nan',7:('109'),8:('109'),9:('11'),10:('109'),11:('109'),12:('109')},'code3':{0:('90'),1:'nan',2:('810'),3:('810'),4:'nan',5:('58'),6:('518'),7:('610','620','682','642','621','611'),8:('620','682','642','611'),9:('113','174','131','115'),10:('612','790','110'),11:('612','110'),12:('423','114')},'code4':{0:('1'),1:'nan',2:('computerscience'),3:('computerscience'),4:'nan',5:('fishing'),6:'nan',7:('biology'),8:('biology'),9:'nan',10:('biology'),11:('biology'),12:'nan'},'code5':{0:'nan',1:'nan',2:'nan',3:'nan',4:'nan',5:'nan',6:'nan',7:'nan',8:'nan',9:('11','19','31'),10:('12','16','18','19'),11:('12','18','19'),12:'31'},'code6':{0:'nan',1:'nan',2:'nan',3:'nan',4:'nan',5:'nan',6:'594',7:('712','479','297','639','452','172'),8:('712','479','297'),9:('164','157','388','158'),10:('285','295','236','239','269','284','237'),11:('285','295','237'),12:('372','238')}}
dictDataRowsByCodeNname = indf_dict
df_dictDataRowsByCodeNname = pd.DataFrame.from_dict(dictDataRowsByCodeNname)
print(f'{dictDataRowsByCodeNname=}')
listDataRowsByRow = df_dictDataRowsByCodeNname.to_dict(orient='records')
print(f'{listDataRowsByRow=}')
print(' ------------- ')
isHit_Column      = []
rules_desc_Column = []
# The loop below tests for only one hit within the rule ...
for dctDataRow in listDataRowsByRow: 
    isHit = False
    for ruleNname, listTuplesCodeNnameValue in dictRules.items():
        if isHit:
            break
        for codeNname, codeNnameValue in listTuplesCodeNnameValue:
            if isHit:
                break
            else:
                if dctDataRow[codeNname] == codeNnameValue: 
                    isHit = True
                    bckpRuleNname = ruleNname
                    break
    rules_desc_Column.append( bckpRuleNname if isHit else None)
    isHit_Column.append(isHit)

print(f'{rules_desc_Column = }')
print(f'{isHit_Column      = }') 
print('================================')
df_dictDataRowsByCodeNname['isHit']      = isHit_Column
df_dictDataRowsByCodeNname['rules_desc'] = rules_desc_Column
print(df_dictDataRowsByCodeNname)
print('================================')

isHit_Column      = []
rules_desc_Column = []
# The loop below tests for all hits within the rule and
# lists all rules that apply in case of hits: 
for dctDataRow in listDataRowsByRow: 
    lstRulesWithHits = []
    for ruleNname, listTuplesCodeNnameValue in dictRules.items():
        ruleItemsWithHits = 0
        for codeNname, codeNnameValue in listTuplesCodeNnameValue:
            if dctDataRow[codeNname] == codeNnameValue: 
                ruleItemsWithHits += 1
        if ruleItemsWithHits == len(listTuplesCodeNnameValue):
            lstRulesWithHits.append(ruleNname)
    isHit = bool(lstRulesWithHits)
    rules_desc_Column.append( lstRulesWithHits if isHit else None)
    isHit_Column.append(isHit)
df_dictDataRowsByCodeNname['isHit']      = isHit_Column
df_dictDataRowsByCodeNname['rules_desc'] = rules_desc_Column
print(df_dictDataRowsByCodeNname)
print('================================')

which gives:

Rules=[{'code1': ('VA', 'HC', 'NIH', 'SAP', 'AUS', 'HOL', 'ATT', 'COL', 'UCL')}, {'code0': '40', 'code3': '518', 'code6': '594'}, {'code0': '98', 'code1': ('ATT', 'NC'), 'code2': ('103', '104', '105', '106', '31'), 'code3': '810', 'code4': 'computerscience'}, {'code0': '98', 'code1': ('ATT', 'VA', 'NC'), 'code2': ('104', '105', '106', '31'), 'code4': 'computerscience'}, {'code0': '52', 'code1': 'NC', 'code2': '109', 'code3': ('610', '620', '682', '642', '621', '611'), 'code4': 'biology', 'code6': ('712', '479', '297', '639', '452', '172')}, {'code0': '52', 'code1': 'NC', 'code2': '109', 'code3': ('396', '340', '394', '393', '240'), 'code4': 'biology', 'code5': ('12', '18')}, {'code0': '52', 'code1': 'NC', 'code2': '109', 'code3': ('612', '790', '110'), 'code4': 'biology', 'code5': ('12', '16', '18', '19'), 'code6': ('285', '295', '236', '239', '269', '284', '237')}, {'code0': '52', 'code1': 'NC', 'code2': '109', 'code3': ('730', '320', '350', '379', '812', '374'), 'code4': 'biology', 'code5': ('12', '18', '19')}, {'code0': '40', 'code1': 'VA', 'code2': '11', 'code3': ('113', '174', '131', '115'), 'code5': ('11', '19', '31'), 'code6': ('164', '157', '388', '158')}, {'code0': '58', 'code1': 'CE', 'code2': '109', 'code3': ('423', '114'), 'code5': '31', 'code6': ('372', '238')}]
dictRules=defaultdict(<class 'list'>, {'rules1': [('code1', ('VA', 'HC', 'NIH', 'SAP', 'AUS', 'HOL', 'ATT', 'COL', 'UCL'))], 'rules12': [('code0', '40'), ('code3', '518'), ('code6', '594')], 'rules2': [('code0', '98'), ('code1', ('ATT', 'NC')), ('code2', ('103', '104', '105', '106', '31')), ('code3', '810'), ('code4', 'computerscience'), ('code0', '98'), ('code1', ('ATT', 'VA', 'NC')), ('code2', ('104', '105', '106', '31')), ('code4', 'computerscience'), ('code0', '52'), ('code1', 'NC'), ('code2', '109'), ('code3', ('610', '620', '682', '642', '621', '611')), ('code4', 'biology'), ('code6', ('712', '479', '297', '639', '452', '172')), ('code0', '52'), ('code1', 'NC'), ('code2', '109'), ('code3', ('396', '340', '394', '393', '240')), ('code4', 'biology'), ('code5', ('12', '18'))], 'rules3': [('code0', '52'), ('code1', 'NC'), ('code2', '109'), ('code3', ('612', '790', '110')), ('code4', 'biology'), ('code5', ('12', '16', '18', '19')), ('code6', ('285', '295', '236', '239', '269', '284', '237')), ('code0', '52'), ('code1', 'NC'), ('code2', '109'), ('code3', ('730', '320', '350', '379', '812', '374')), ('code4', 'biology'), ('code5', ('12', '18', '19'))], 'rules4': [('code0', '40'), ('code1', 'VA'), ('code2', '11'), ('code3', ('113', '174', '131', '115')), ('code5', ('11', '19', '31')), ('code6', ('164', '157', '388', '158'))], 'rules5': [('code0', '58'), ('code1', 'CE'), ('code2', '109'), ('code3', ('423', '114')), ('code5', '31'), ('code6', ('372', '238'))]})
 ------------- 
dictDataRowsByCodeNname={'code0': {0: '5', 1: 'nan', 2: '98', 3: '98', 4: '', 5: '15', 6: '40', 7: '52', 8: '52', 9: '40', 10: '52', 11: '52', 12: '58'}, 'code1': {0: ('Agr', 'Serv'), 1: ('VA', 'HC', 'NIH', 'SAP', 'AUS', 'HOL', 'ATT', 'COL', 'UCL'), 2: ('ATT', 'NC'), 3: ('ATT', 'VA', 'NC'), 4: ('VA', 'HC', 'NIH', 'ATT', 'COL', 'UCL'), 5: 'Agr', 6: 'nan', 7: 'NC', 8: 'NC', 9: 'VA', 10: 'NC', 11: 'NC', 12: 'CE'}, 'code2': {0: 'nan', 1: 'nan', 2: ('103', '104', '105', '106', '31'), 3: ('104', '105'), 4: 'nan', 5: '5', 6: 'nan', 7: '109', 8: '109', 9: '11', 10: '109', 11: '109', 12: '109'}, 'code3': {0: '90', 1: 'nan', 2: '810', 3: '810', 4: 'nan', 5: '58', 6: '518', 7: ('610', '620', '682', '642', '621', '611'), 8: ('620', '682', '642', '611'), 9: ('113', '174', '131', '115'), 10: ('612', '790', '110'), 11: ('612', '110'), 12: ('423', '114')}, 'code4': {0: '1', 1: 'nan', 2: 'computerscience', 3: 'computerscience', 4: 'nan', 5: 'fishing', 6: 'nan', 7: 'biology', 8: 'biology', 9: 'nan', 10: 'biology', 11: 'biology', 12: 'nan'}, 'code5': {0: 'nan', 1: 'nan', 2: 'nan', 3: 'nan', 4: 'nan', 5: 'nan', 6: 'nan', 7: 'nan', 8: 'nan', 9: ('11', '19', '31'), 10: ('12', '16', '18', '19'), 11: ('12', '18', '19'), 12: '31'}, 'code6': {0: 'nan', 1: 'nan', 2: 'nan', 3: 'nan', 4: 'nan', 5: 'nan', 6: '594', 7: ('712', '479', '297', '639', '452', '172'), 8: ('712', '479', '297'), 9: ('164', '157', '388', '158'), 10: ('285', '295', '236', '239', '269', '284', '237'), 11: ('285', '295', '237'), 12: ('372', '238')}}
listDataRowsByRow=[{'code0': '5', 'code1': ('Agr', 'Serv'), 'code2': 'nan', 'code3': '90', 'code4': '1', 'code5': 'nan', 'code6': 'nan'}, {'code0': 'nan', 'code1': ('VA', 'HC', 'NIH', 'SAP', 'AUS', 'HOL', 'ATT', 'COL', 'UCL'), 'code2': 'nan', 'code3': 'nan', 'code4': 'nan', 'code5': 'nan', 'code6': 'nan'}, {'code0': '98', 'code1': ('ATT', 'NC'), 'code2': ('103', '104', '105', '106', '31'), 'code3': '810', 'code4': 'computerscience', 'code5': 'nan', 'code6': 'nan'}, {'code0': '98', 'code1': ('ATT', 'VA', 'NC'), 'code2': ('104', '105'), 'code3': '810', 'code4': 'computerscience', 'code5': 'nan', 'code6': 'nan'}, {'code0': '', 'code1': ('VA', 'HC', 'NIH', 'ATT', 'COL', 'UCL'), 'code2': 'nan', 'code3': 'nan', 'code4': 'nan', 'code5': 'nan', 'code6': 'nan'}, {'code0': '15', 'code1': 'Agr', 'code2': '5', 'code3': '58', 'code4': 'fishing', 'code5': 'nan', 'code6': 'nan'}, {'code0': '40', 'code1': 'nan', 'code2': 'nan', 'code3': '518', 'code4': 'nan', 'code5': 'nan', 'code6': '594'}, {'code0': '52', 'code1': 'NC', 'code2': '109', 'code3': ('610', '620', '682', '642', '621', '611'), 'code4': 'biology', 'code5': 'nan', 'code6': ('712', '479', '297', '639', '452', '172')}, {'code0': '52', 'code1': 'NC', 'code2': '109', 'code3': ('620', '682', '642', '611'), 'code4': 'biology', 'code5': 'nan', 'code6': ('712', '479', '297')}, {'code0': '40', 'code1': 'VA', 'code2': '11', 'code3': ('113', '174', '131', '115'), 'code4': 'nan', 'code5': ('11', '19', '31'), 'code6': ('164', '157', '388', '158')}, {'code0': '52', 'code1': 'NC', 'code2': '109', 'code3': ('612', '790', '110'), 'code4': 'biology', 'code5': ('12', '16', '18', '19'), 'code6': ('285', '295', '236', '239', '269', '284', '237')}, {'code0': '52', 'code1': 'NC', 'code2': '109', 'code3': ('612', '110'), 'code4': 'biology', 'code5': ('12', '18', '19'), 'code6': ('285', '295', '237')}, {'code0': '58', 'code1': 'CE', 'code2': '109', 'code3': ('423', '114'), 'code4': 'nan', 'code5': '31', 'code6': ('372', '238')}]
 ------------- 
rules_desc_Column = [None, 'rules12', 'rules3', 'rules3', None, None, 'rules2', 'rules3', 'rules3', 'rules2', 'rules3', 'rules3', 'rules3']
isHit_Column      = [False, True, True, True, False, False, True, True, True, True, True, True, True]
================================
   code0                                        code1  ...  isHit rules_desc
0      5                                  (Agr, Serv)  ...  False       None
1    nan  (VA, HC, NIH, SAP, AUS, HOL, ATT, COL, UCL)  ...   True    rules12
2     98                                    (ATT, NC)  ...   True     rules3
3     98                                (ATT, VA, NC)  ...   True     rules3
4                        (VA, HC, NIH, ATT, COL, UCL)  ...  False       None
5     15                                          Agr  ...  False       None
6     40                                          nan  ...   True     rules2
7     52                                           NC  ...   True     rules3
8     52                                           NC  ...   True     rules3
9     40                                           VA  ...   True     rules2
10    52                                           NC  ...   True     rules3
11    52                                           NC  ...   True     rules3
12    58                                           CE  ...   True     rules3

[13 rows x 9 columns]
================================
   code0                                        code1  ...  isHit rules_desc
0      5                                  (Agr, Serv)  ...  False       None
1    nan  (VA, HC, NIH, SAP, AUS, HOL, ATT, COL, UCL)  ...   True   [rules1]
2     98                                    (ATT, NC)  ...  False       None
3     98                                (ATT, VA, NC)  ...  False       None
4                        (VA, HC, NIH, ATT, COL, UCL)  ...  False       None
5     15                                          Agr  ...  False       None
6     40                                          nan  ...   True  [rules12]
7     52                                           NC  ...  False       None
8     52                                           NC  ...  False       None
9     40                                           VA  ...   True   [rules4]
10    52                                           NC  ...  False       None
11    52                                           NC  ...  False       None
12    58                                           CE  ...   True   [rules5]

[13 rows x 9 columns]
================================

P.S. The first final loop in the code above does NOT accumulate the hits providing a list of all rules which apply if there is a hit. In other words the search for hits is stopped after the first hit and first rule item which give a hit.

The second final loop tests all rule items and collects the rules which give hits in a list.

Answered By: Claudio

how to compare each cell of dataframe with list of dictionary in python?

Question:

Answers: