Matching list of regular expression to list of strings

Question:

I’m working on matching a list of regular expressions with a list of strings. The problem is, that the lists are very big (RegEx about 1 million, strings about 50T). What I’ve got so far is this:

reg_list = ["domain.com/picture.png", "entry{0,9}"]

y = ["test","string","entry4also_found","entry5"]

for r in reg_list:
    for x in y:
        if re.findall(r, x):
            RESULT_LIST.append(x)
            print(x)

Which works very well logically but is way to unefficient for those number of entries. Is there a better (more efficient) solution for this?

Asked By: plategt

||

Answers:

The only enhancements that come to mind are

  • Stopping match at first occurrence as re.findall attempts to search for multiple matches, this is not what you are after
  • Pre-compiling your regexes.
reg_list = [r"domain.com/picture.png", r"entry{0,9}"]
reg_list = [re.compile(x) for x in reg_list]            # Step 1

y = ["test","string","entry4also_found","entry5"]

RESULT_LIST = []
for r in reg_list:
    for x in y:
        if r.search(x):                                 # Step 2
            RESULT_LIST.append(x)
            print(x)
Answered By: Ryszard Czech

Use any() to test if any of the regular expressions match, rather than looping over the entire list.

Compile all the regular expressions first, so this doesn’t have to be done repeatedly.

reg_list = [re.compile(rx) for rx in reg_list]

for word in y:
    if any(rx.search(word) for rx in reg_list):
        RESULT_LIST.append(word)
Answered By: Barmar
python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop

$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop

So, if you are going to be using the same regex a lot, it may be worth it to do re.compile (especially for more complex regexes).

Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.