Regex – occurrences of a batch of keywords in a text

Question:

I’m doing keyword extraction on documents.

Entries are :

  • thousands of documents (up to 2GB in size)
  • about ~200k keywords aggregated by categories

As of now, for every document, we search every keyword one by one, which I think is inefficient.

So I thought about compiling regexes by category of keywords using pipes:

import re

text = """
Contrary to popular belief, Lorem Ipsum is not simply random text.
It has roots in a piece of classical Latin literature from 45 BC,
making it over 2000 years old.
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia,
looked up one of the more obscure Latin words,
consectetur, from a Lorem Ipsum passage,
and going through the cites of the word in classical literature,
discovered the undoubtable source.
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of
"de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero,
written in 45 BC. This book is a treatise on the theory of ethics,
very popular during the Renaissance. The first line of Lorem Ipsum,
"Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32. 
"""

regexes = [
    r'(?P<Writing__book>book)',
    r'(?P<Writing__word>word)',
    r'(?P<Writing__latin>latin)',
    r'(?P<Writing__text>text)',
    r'(?P<Writing__literature>literature)',
    r'(?P<Cities__virginia>virginia)',
    r'(?P<Genre__classical>classical)',
    r'(?P<Genre__renaissance>renaissance)',
]
compiled_regex = '|'.join(regexes)
results = re.findall(
        compiled_regex,
        text,
        flags=re.MULTILINE | re.IGNORECASE
    )
for result in results:
    print(result)

This prints:

('', '', '', 'text', '', '', '', '')
('', '', '', '', '', '', 'classical', '')
('', '', 'Latin', '', '', '', '', '')
('', '', '', '', 'literature', '', '', '')
('', '', 'Latin', '', '', '', '', '')
('', '', '', '', '', 'Virginia', '', '')
('', '', 'Latin', '', '', '', '', '')
('', 'word', '', '', '', '', '', '')
('', 'word', '', '', '', '', '', '')
('', '', '', '', '', '', 'classical', '')
('', '', '', '', 'literature', '', '', '')
('book', '', '', '', '', '', '', '')
('', '', '', '', '', '', '', 'Renaissance')

What I’d like to get is a dictionary with each category__keyword and the number of occurrences, like:

{'Writing__book': 1, 'Writing__word': 2, 'Cities__virginia': 1, ...}
Asked By: Loïc

||

Answers:

(untested) But I would do something like:

# May need to remove other punctuation here using .replace()
input_as_list = text.split(" ").replace(",", "").replace(".", "").replace("(", "").replace('"', "")

# Add any desired words here
words_to_find = ["book", "word", "latin"]

# Output dict
output = {}

for word in words_to_find:
    output[word] = input_as_list.count(word)

print(output)

This will return something that looks like:

{"book": 7, "word": 5, "latin": 3}

Using Python built-in string methods is recommended over regex as their behavior is more clear.

Answered By: Jeremy Savage

You could search for all words (sequences of letters) using a regex and then count words using a Counter. Then use a comprehension over a dictionary of words in each category to build your desired result:

from collections import Counter

words = { 'Writing' : ['word', 'book', 'latin', 'text', 'literature'],
          'Cities' : ['virginia'],
          'Genre' : ['classical', 'renaissance']
        }
counts = Counter(map(str.lower, re.findall(r'b[a-zA-Z]+b', text)))
result = { f'{k}__{w}' : counts[w] for k, v in words.items() for w in v }

Output:

{
    "Writing__word": 1,
    "Writing__book": 1,
    "Writing__latin": 3,
    "Writing__text": 1,
    "Writing__literature": 2,
    "Cities__virginia": 1,
    "Genre__classical": 2,
    "Genre__renaissance": 1
}

Better yet, produce a dict of dict of counts:

result = { k : { w : counts[w] for w in v } for k, v in words.items() }

Output:

{
    "Writing": {
        "word": 1,
        "book": 1,
        "latin": 3,
        "text": 1,
        "literature": 2,
        "fred": 0
    },
    "Cities": {
        "virginia": 1
    },
    "Genre": {
        "classical": 2,
        "renaissance": 1
    }
}
Answered By: Nick

Here is a solution you can try,

import re

from collections import defaultdict

text = """..."""

regexes = ["..."]

compiled_regex = '|'.join(regexes)

results = re.finditer(  # <-- Change to finditer, which returns a iterator (efficient on large data)
    compiled_regex,
    text,
    flags=re.MULTILINE | re.IGNORECASE
)

word_counts = defaultdict(int)  # <-- Default dict to track counts

for result in results:
    for key_, value_ in result.groupdict().items():  # <-- Use group dict, since the you have named capturing group
        if value_:
            word_counts[key_] += 1

print(word_counts)

defaultdict(<class 'int'>, {'Writing__text': 1, 'Genre__classical': 2, 'Writing__latin': 3, 'Writing__literature': 2, 'Cities__virginia': 1, 'Writing__word': 2, 'Writing__book': 1, 'Genre__renaissance': 1})
Answered By: sushanth

Some notes on performance. All testing was done using timeit with 1 iterations of the code on a dual Xeon server with 192GB RAM and SSD drives. The following functions were used (note that I’ve only including the counting code, since for large files that will vastly outweigh any reformatting cost):

def count_sushanth(text, regex):
    results = re.finditer(
        regex,
        text,
        flags=re.MULTILINE | re.IGNORECASE
    )

    word_counts = defaultdict(int)  # <-- Default dict to track counts

    for result in results:
        for key_, value_ in result.groupdict().items():  # <-- Use group dict, since the you have named capturing group
            if value_:
                word_counts[key_] += 1

def count_nick(text):
    counts = Counter(re.split(r's*[^a-z0-9]', text.lower()))

The sample data from the question was used, but expanded 1M times (text = text * 1_000_000) to make it about 750MB.

The results for the original code were count_sushanth : 114.35 seconds; count_nick : 68.72 seconds.

Not great. It did occur to me that my code was not as optimal as it might be so I modified it to just find words instead:

def count_nick_new(text):
    return Counter(map(str.lower, re.findall(r'[a-zA-Z]+', text)))

This gave a bit of an improvement to 43.51 seconds. What about the power of the word break (b)?

def count_nick_new_wb(text):
    return Counter(map(str.lower, re.findall(r'b[a-zA-Z]+b', text)))

Now we’re talking: 0.55 seconds – an almost 100x improvement in speed. Applying the same optimisation to sushanth’s code:

sushanth_regex_new = r'b(' + '|'.join(regexes) + r')b'

gives 0.56 seconds and has the added benefit of preventing word matching wordle and sword.

So what about compiling the regex?

nick_regex_comp = re.compile(nick_regex)

def count_nick_new_comp(text, compiled_regex):
    return Counter(map(str.lower, compiled_regex.findall(text)))

sushanth_regex_comp = re.compile(r'b(' + '|'.join(regexes) + r')b', re.MULTILINE | re.IGNORECASE)

def count_sushanth_comp(text, regex):
    results = regex.finditer(text)
    word_counts = defaultdict(int)  # <-- Default dict to track counts
    for result in results:
        for key_, value_ in result.groupdict().items():  # <-- Use group dict, since the you have named capturing group
            if value_:
                word_counts[key_] += 1
    return word_counts

In both cases this actually had minimal effect on the performance of the code, suggesting that most of the time is being spent processing the results of the find.

Since my code was spending time in lower-casing all the results, I thought I’d try lower-casing the entire text:

def count_nick_new_lower(text):
    return Counter(re.findall(r'b[a-z]+b', text.lower()))

This actually caused about a 2.5x slowdown, to 1.31 seconds.

I also tried using an iterator in my code:

def count_nick_new_iter(text, regex):
    return Counter(map(lambda m:m.group().lower(), re.finditer(regex, text)))

This had no effect on performance; probably because the entire string could be held in memory anyway.

Final summary of results:

function time notes
count_sushanth 114.35 matches sword and wordle to word
count_nick 68.72 only deals with single word search terms
count_nick_new 43.51
count_nick_new_wb 0.55
count_sushanth_wb 0.56 solves the mismatching problem too!
count_nick_new_comp 0.55 no performance improvement for regex compilation
count_sushanth_comp 0.55 very minor performance improvement
count_nick_new_lower 1.31 Significant penalty to lower-case the entire text
count_nick_new_iter 0.55 no change
Answered By: Nick
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.