Read dict values as regex, return matches

Question:

I have a python dictionary that contains a list of terms as values:

myDict = {
    ID_1: ['(dog|cat[a-z+]|horse)', '(car[a-z]+|house|applew)', '(bird|tree|panda)'],
    ID_2: ['(horse|building|computer)', '(pandaw|lion)'],
    ID_3: ['(wagon|tiger|catw*)'],
    ID_4: ['(dog)']    
    }

I want to be able to read the the list-items in each value as individual regular expressions and if they match any text, have the matched text returned as keys in a separate dictionary with their original keys (the IDs) as the values.

So if these terms were read as regexes for searching this string:

"dog panda cat cats pandas car carts"

The general approach I have in mind is something like:

for key, value in myDict:
    for item in value:
        if re.compile(item) = match-in-text:
            newDict[match] = [list of keys]

The expected output would be:

newDict = {
    car: [ID_1],
    carts: [ID_1],
    dog: [ID_1, ID_4],
    panda: [ID_1, ID_2],
    pandas: [ID_1, ID_2],
    cat: [ID_1, ID_3],
    cats: [ID_1, ID_3]
    }

The matched text should be returned as a key in newDict only if they’ve actually matched something in the body of text. So in the output, ‘Carts’ is listed there since the regex in ID_1’s values matched with it. And therefore the ID is listed in the output dict.

Asked By: Silent-J

||

Answers:

One way is to convert the regex into vanilla lists e.g. with string manipulation:

In [11]: {id_: "|".join(ls).replace("(", "").replace(")", "").split("|") for id_, ls in myDict.items()}
Out[11]:
{'ID_1': ['dog',
  'cat',
  'horse',
  'car',
  'house',
  'apples',
  'bird',
  'tree',
  'panda'],
 'ID_2': ['horse', 'building', 'computer', 'panda', 'lion'],
 'ID_3': ['wagon', 'tiger', 'cat'],
 'ID_4': ['dog']}

You can make this into a DataFrame:

In [12]: from collections import Counter

In [13]: pd.DataFrame({id_:Counter( "|".join(ls).replace("(", "").replace(")", "").split("|") ) for id_, ls in myDict.items()}).fillna(0).astype(int)
Out[13]:
          ID_1  ID_2  ID_3  ID_4
apples       1     0     0     0
bird         1     0     0     0
building     0     1     0     0
car          1     0     0     0
cat          1     0     1     0
computer     0     1     0     0
dog          1     0     0     1
horse        1     1     0     0
house        1     0     0     0
lion         0     1     0     0
panda        1     1     0     0
tiger        0     0     1     0
tree         1     0     0     0
wagon        0     0     1     0
Answered By: Andy Hayden

Here’s a simple script that seems to fit your requirements:

import re
from collections import defaultdict

text = """
the eye of the tiger
a dog in the manger
the cat in the hat
a kingdom for my horse
a bird in the hand
"""

myDict = {
    'ID_1': ['(dog|cat|horse)', '(car|house|apples)', '(bird|tree|panda)'],
    'ID_2': ['(horse|building|computer)', '(panda|lion)'],
    'ID_3': ['(wagon|tiger|cat)'],
    'ID_4': ['(dog)'],
    }

newDict = defaultdict(list)

for key, values in myDict.items():
    for pattern in values:
        for match in re.finditer(pattern, text):
            newDict[match.group(0)].append(key)

for item in newDict.items():
    print(item)

output:

('dog', ['ID_1', 'ID_4'])
('cat', ['ID_1', 'ID_3'])
('horse', ['ID_1', 'ID_2'])
('bird', ['ID_1'])
('tiger', ['ID_3'])
Answered By: ekhumoro