heavy regex – really time consuming

Question:

I have the following regex to detect start and end script tags in the html file:

<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script>

meaning in short it will catch: <script "NOT THIS</s" > "NOT THIS</s" </script>

it works but needs really long time to detect <script>,
even minutes or hours for long strings

The lite version works perfectly even for long string:

<script[^<]*>[^<]*</script>

however, the extended pattern I use as well for other tags like <a> where < and > are possible to appears also as values of attributes.

python test:

import re
pattern = re.compile('<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:^s]))*)</script>', re.I + re.DOTALL)
re.search(pattern, '11<script type="text/javascript"> easy>example</script>22').group()
re.search(pattern, '<script type="text/javascript">' + ('hard example' * 50) + '</script>').group()

how can I fix it?
The inner part of regex (after <script>) should be changed and simplified.

PS 🙂 Anticipate your answers about the wrong approach like using regex in html parsing,
I know very well many html/xml parsers, and what I can expect in often broken html code, and regex is really useful here.

comment:
well, I need to handle:
each <a < document like this.border="5px;">
and approach is to use parsers and regex together
BeautifulSoup is only 2k lines, which not handling every html and just extends regex from sgmllib.

and the main reason is that I must know exact the position where every tag starts and stop. and every broken html must be handled.

BS is not perfect, sometimes happens:
BeautifulSoup(‘< scriPtnn>a<aa>s< /script>’).findAll(‘script’) == []

@Cylian:
atomic grouping as you know is not available in python’s re.
so non-geedy everything .*? until <s/stags*>** is a winner at this time.

I know that is not perfect in that case:
re.search(‘<sscript.?<s*/sscripts>’,'< script </script> shit </script>’).group()
but I can handle refused tail in the next parsing.

It’s pretty obvious that html parsing with regex is not one battle figthing.

Asked By: Sławomir Lenart

||

Answers:

I don’t know python, but I know regular expressions:

if you use the greedy/non-greedy operators you get a much simpler regex:

<script.*?>.*?</script>

This is assuming there are no nested scripts.

Answered By: ilomambo

The problem in pattern is that it is backtracking. Using atomic groups this issue could be solved. Change your pattern to this**

<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>   
         ^^^^^                           ^^^^^

Explanation

<!--
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>

Match the characters “<script” literally «<script»
Python does not support atomic grouping «(?>[^<]+?|<(?:[^/]|/(?:[^s])))*»
   Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+?»
      Match any character that is NOT a “<” «[^<]+?»
         Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
   Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))»
      Match the character “<” literally «<»
      Match the regular expression below «(?:[^/]|/(?:[^s]))»
         Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
            Match any character that is NOT a “/” «[^/]»
         Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
            Match the character “/” literally «/»
            Match the regular expression below «(?:[^s])»
               Match any character that is NOT a “s” «[^s]»
Match the character “>” literally «>»
Python does not support atomic grouping «(?>[^<]+|<(?:[^/]|/(?:[^s]))*)»
   Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+»
      Match any character that is NOT a “<” «[^<]+»
         Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
   Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))*»
      Match the character “<” literally «<»
      Match the regular expression below «(?:[^/]|/(?:[^s]))*»
         Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
         Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
            Match any character that is NOT a “/” «[^/]»
         Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
            Match the character “/” literally «/»
            Match the regular expression below «(?:[^s])»
               Match any character that is NOT a “s” «[^s]»
Match the characters “</script>” literally «</script>»
-->
Answered By: Cylian

Use an HTML parser like beautifulsoup.

See the great answers for “Can I remove script tags with beautifulsoup?”.

If your only tool is a hammer, every problem starts looking like a nail. Regular expressions are a powerful hammer but not always the best solution for some problems.

I guess you want to remove scripts from HTML posted by users for security reasons. If security is the main concern, regular expressions are hard to implement because there are so many things a hacker can modify to fool your regex, yet most browsers will happily evaluate… An specialized parser is easier to use, performs better and is safer.

If you are still thinking “why can’t I use regex”, read this answer pointed by mayhewr‘s comment. I could not put it better, the guy nailed it, and his 4433 upvotes are well deserved.

Answered By: Paulo Scardine