Efficient and fast way to search through dict of dicts

Question:

So I have a dict of working jobs each holding a dict

{
    "hacker": {"crime": "high"},
    "mugger": {"crime": "high", "morals": "low"},
    "office drone": {"work_drive": "high", "tolerance": "high"},
    "farmer": {"work_drive": "high"},
}

And I have roughly about 21000 more unique jobs to handle

How would I go about scanning through them faster?

And is there any type of data structure that makes this faster and better to scan through? Such as a lookup table for each of the tags?

I’m using python 3.10.4

NOTE: If it helps, everything is loaded up at the start of runtime and doesn’t change during runtime at all

Here’s my current code:

test_data = {
    "hacker": {"crime": "high"},
    "mugger": {"crime": "high", "morals": "low"},
    "shop_owner": {"crime": "high", "morals": "high"},
    "office_drone": {"work_drive": "high", "tolerance": "high"},
    "farmer": {"work_drive": "high"},
}

class NULL: pass

class Conditional(object):
    def __init__(self, data):
        self.dataset = data
        
    def find(self, *target, **tags):
        dataset = self.dataset.items()
   
        if target:
            dataset = (
                (entry, data) for entry, data in dataset
                if all( (t in data) for t in target)
                )

        if tags:
            return [
                entry for entry, data in dataset
                if all(
                    (data.get(tag, NULL) == val) for tag, val in tags.items()
                    )
                ]
        else:
             return [data[0] for data in dataset]

jobs = Conditional(test_data)

print(jobs.find(work_drive="high"))
>>> ['office_drone', 'farmer']
print(jobs.find("crime"))
>>> ['hacker', 'mugger', 'shop_owner']
print(jobs.find("crime", "morals"))
>>> ['mugger', 'shop_owner']
print(jobs.find("crime", morals="high"))
>>> ['shop_owner']

Asked By: Dimsey

||

Answers:

When looking up the first-level in the dictionary, the way to do that is either with my_dict[key] or my_dict.get(key) (they do the same thing). So I think you just want to do that with your target lookup.

Then, if you want to look up which jobs include anything about one of the tags, then I think that yea making a lookup dictionary for that is reasonable. You could make a dictionary where each key maps to a list of those jobs.

The below code would be run once at the beginning and would make the lookup based off of the test_data. It loops through the entire dictionary and any time it encounters a tag in the values for an item, it’ll add the key from it to the list of jobs for that tag

lookup = dict()
for k,v in test_data.items():
    for kk,vv in v.items():
         try:
             lookup[kk].append(k)
         except KeyError:
             lookup[kk] = [k]

Output (lookup):

{'crime': ['hacker', 'mugger', 'shop_owner'],
 'morals': ['mugger', 'shop_owner'],
 'work_drive': ['office_drone', 'farmer'],
 'tolerance': ['office_drone']}

With this lookup table, you could ask ‘Which jobs have a crime stat?’ with lookup['crime'], which would output ['hacker', 'mugger', 'shop_owner']

Answered By: scotscotmcc

And is there any type of data structure that makes this faster and better to scan through?

Yes. And it is called dict =)

Just turn your dict into two dictionaries one by tag and another by tag and tag value which will contain sets:

from collections import defaultdict

... 

by_tag = defaultdict(set)
by_tag_value = defaultdict(lambda: defaultdict(set))

for job, tags in test_data.items():
    for tag, val in tags.items():
        by_tag[tag].add(job)
        by_tag_value[tag][val].add(job)

# example
# to search crime:high and morals 

crime_high = by_tag_value["crime"]["high"]
morals = by_tag["morals"]
result = crime_high.intersection(morals) # {'mugger', 'shop_owner'}

And then use them to search needed sets and return jobs which are present in all of the sets.

Answered By: Guru Stron
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.