Python infinite loop in regex to match url

Question

I am trying to extract URLs from text file and stuck in an infinite loop

import re

URL_PATTERN = re.compile(ur'''(?i)b((?:https?://|wwwd{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^s()<>]+|(([^s()<>]+|(([^s()<>]+)))*))+(?:(([^s()<>]+|(([^s()<>]+)))*)|[^s`!()[]{};:'".,<>?«»“”‘’]))''')

with open("some_text_file") as RAW:
    for line in RAW:
        RESULT = URL_PATTERN.findall(line)
        links = []
        for HTTP_TUPLES in RESULT:
            links.append(HTTP_TUPLES[0])

How i can avoid that?

PS: Yes, i know about urllib and other modules

Asked By: Vladimir

||

Source

Answer 1

Try:

import re

URL_PATTERN = re.compile(ur'''(?i)b((?:https?://|wwwd{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^s()<>]+|(([^s()<>]+|(([^s()<>]+)))*))+(?:(([^s()<>]+|(([^s()<>]+)))*)|[^s`!()[]{};:'".,<>?«»“”‘’]))''')

RESULT = []
with open("some_text_file") as RAW:
  map(lambda x:RESULT.extend(URL_PATTERN.findall(x)), RAW.xreadlines())

In Python 3, remove xreadlines(), as the file object itself is an iterator.

Answered By: belteshazzar

Answer 2

(?i)b((?:https?://|wwwd{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^s()<>'",]+|(([^s()<>'",]+|(([^s()<>'",]+)))*))+(?:(([^s()<>'",]+|(([^s()<>'",]+)))*)|[^s`!()[]{};:'".,<>?«»“”‘’]))

Try this.This will do it for you.See demo.

https://regex101.com/r/ib6eed/1

Answered By: vks

Answer 3

I don’t address the correctness of the regex in this answer. You might want to take a look at this article on URL validation and customize it for your matching task.

Problem

Your regex includes classical example of catastrophic backtracking in the form of (A*)*.

For example, in this portion:

(?:[^s()<>]+|(([^s()<>]+|(([^s()<>]+)))*))+

If you throw away the second branch, you will immediately see the problem:

(?:[^s()<>]+)+

The second branch also contains an instance of the problematic pattern:

([^s()<>]+|(([^s()<>]+)))*

degenerates to:

([^s()<>]+)*

To demonstrate the problem you can test your regex on this non-matching string:

sdfsdf http://www/sdfsdfsdf(sdsdfsdfsdfsdfsdfsdf sfsdf(Sdfsdf)(sdfsdF)(sdfdsF)(<))sdsdfsf

Demo on regex101

Solution

Using the snippet above from your regex to demo:

(?:[^s()<>]+|(([^s()<>]+|(([^s()<>]+)))*))+
            ^             ^

In languages which supports possessive quantifier, since the 2 branches of your regex are mutual exclusive, it is an option to make those quantifiers possessive.

However, since Python doesn’t support possessive quantifier, you can remove the quantifiers at the positions marked without affecting the result, since it has been taken care of by the quantifier in the immediate outer layer.

The final result (which takes care of the same problem in the last group):

(?i)b((?:https?://|wwwd{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^s()<>]|(([^s()<>]|(([^s()<>]+)))*))+(?:(([^s()<>]|(([^s()<>]+)))*)|[^s`!()[]{};:'".,<>?«»“”‘’]))

Demo on regex101

Answered By: nhahtdh

Python infinite loop in regex to match url

Question:

Answers:

Problem

Solution