Is there a better way to capture all the regex patterns in matching with nested lists within a dictionary?

Question:

I am trying out a simple text-matching activity where I scraped titles of blog posts and try to match it with my pre-defined categories once I find specific keywords.

So for example, the title of the blog post is

"Capture Perfect Night Shots with the Oppo Reno8 Series"

Once I ensure that "Oppo" is included in my categories, "Oppo" should match with my "phone" category like so:

categories = {"phone" : ['apple', 'oppo', 'xiaomi', 'samsung', 'huawei', 'nokia'],
"postpaid" : ['signature', 'postpaid'],
"prepaid" : ['power all', 'giga'],
"sku" : ['data', 'smart bro'],
"ewallet" : ['gigapay'],
"event" : ['gigafest'],
"software" : ['ios', 'android', 'macos', 'windows'],
"subculture" : ['anime', 'korean', 'kpop', 'gaming', 'pop', 'culture', 'lgbtq', 'binge', 'netflix', 'games', 'ml', 'apple music'],
"health" : ['workout', 'workouts', 'exercise', 'exercises'],
"crypto" : ['axie', 'bitcoin', 'coin', 'crypto', 'cryptocurrency', 'nft'],
"virtual" : ['metaverse', 'virtual']}

Then my dataframe would look like this

Fortunately I found a reference to how to use regex in mapping to nested dictionaries but it can’t seem to work past the first couple of words

Reference is here

So once I use the code

def put_category(cats, text):

    regex = re.compile("(%s)" % "|".join(map(re.escape, categories.keys())))

    if regex.search(text):
        ret = regex.search(text)
        return ret[0]
    else:
        return 'general'

It usually reverts to put "general" as the category, even when doing it in lowercase as seen here

I’d prefer to use the current method of inputting values inside the dictionary for this matching activity instead of running pure regex patterns and then putting it through fuzzy matching for the result.

Asked By: Nicoconut

||

Answers:

You can create a reverse mapping that maps keywords to categories instead, so that you can efficiently return the corresponding category when a match is found:

mapping = {keyword: category for category, keywords in categories.items() for keyword in keywords}

def put_category(mapping, text):
    match = re.search(rf'b(?:{"|".join(map(re.escape, mapping))})b', text, re.I)
    if match:
        return mapping[match[0].lower()]
    return 'general'

print(put_category(mapping, "Capture Perfect Night Shots with the Oppo Reno8 Series"))

This outputs:

phone

Demo: https://replit.com/@blhsing/BlandAdoredParser

Answered By: blhsing

In this case, you are matching exact words, and not patterns. You can do it without regular expressions.

Going back to your example:

import pandas as pd

CAT_DICT = {"phone" : ['apple', 'oppo', 'xiaomi', 'samsung', 'huawei', 'nokia'],
"postpaid" : ['signature', 'postpaid'],
"prepaid" : ['power all', 'giga'],
"sku" : ['data', 'smart bro'],
"ewallet" : ['gigapay'],
"event" : ['gigafest'],
"software" : ['ios', 'android', 'macos', 'windows'],
"subculture" : ['anime', 'korean', 'kpop', 'gaming', 'pop', 'culture', 'lgbtq', 'binge', 'netflix', 'games', 'ml', 'apple music'],
"health" : ['workout', 'workouts', 'exercise', 'exercises'],
"crypto" : ['axie', 'bitcoin', 'coin', 'crypto', 'cryptocurrency', 'nft'],
"virtual" : ['metaverse', 'virtual']}

df = pd.DataFrame({"title": [
    "Capture Perfect Night Shots with the Oppo Reno8 Series",
    "Personal is Powerful: Why Apple's iOS 16 is the Smartest update"
]})

You can define this function to assign categories to each title:

def assign_cat(title: str, cat_dict: dict[str, list[str]]) -> list[str]:
    title_low = title.lower()
    categories = list()
    for c,words in cat_dict.items():
        if any([w in title_low for w in words]):
            categories.append(c)
    if len(categories) == 0:
        categories.append("general")
    return categories

The key part is here: any([w in title_low for w in words]). For each word in your category, you are checking if it is present in the title (lowercase). And if ANY of the words is present, you associate the category to it.

You get:
enter image description here

The advantage of this approach is that a title can have multiple categories assigned to it (see the 2nd title)

Answered By: slymore