Regex whitelisted url lets blocked urls through on the same line in message

Question:

So I have a regex expression which blocks URLs in a message, but I want to whitelist the site’s URL.

currently it works with any prefix like HTTP://www.example.com and with www.example.com/support/how-do-i-setup-this but if I put another URL behind this then it gets through the filter which I don’t want (only if I put the new URL on a new line it gets blocked as required)

"go to http://example.com/support/how-do-i www.badurl.com" this doesn’t block the badurl which I want to happen

also this string results in both being blocked "www.badurl.comexample.com" but ideally I would like to whitelist the example.com URL here too

[-a-zA-Z0-9@:%_+.~#?&//=]{2,256}.[a-z]{2,24}b(/[-a-zA-Z0-9@:%_+.~#?&//=]*)?(?<!bexample.com(/.*)?)

Current python function code

import re

def link_remover(message):
   #remove any links that aren't in whitelist
   message = re.sub(r"[-a-zA-Z0-9@:%_+.~#?&//=]{2,256}.[a-z]{2,24}b(/[-a-zA-Z0-9@:%_+.~#?&//=]*)?(?<!bexample.com)", "[URL Removed]", message)
   return message

so I’m just wondering how to edit it to fix those two examples which fail?

I appreciate any responses or pointing me in the right direction 🙂

Asked By: AlphaDjango

||

Answers:

Final Update:
Added a negative lookbehind boundary that starts every check for a whitelist item.
Example: (?<=(?<![-a-zu00a1-uffff0-9])example.com)
This class ensures that only an optional Subdomain can come before it.
As the only optional parts of the regex allow only a dot or forward slash.
Therefore no bleed of letters can be adjacent to it, for example wrongexample.com .

This is an example where the whitelist items are optionally matched.
Every url is matched. The whitelist check is strategically placed right after the domain
is matched. Therefore the match will encompass any trailing optional ports or directories.

A lambda callback is all that’s needed to check if any of the whitelist urls matched.
If so, just write them back unchanged.
If none matched then write back the Removed string.

Modified logic:
Changed to only need one capture group.
The group is used as a flag.

If the group is None, no whitelist item was found for the match.
Returns return {Empty} in the callback and overwrites the bad url.

Otherwise a whitelist item was found. The match is return unchanged.
return m.group(0).

Notes:
All url’s are matched. Single capture group. Unlimited number of whitelist items.
Follow the template below to add the whitelist items.

(?!mailto:)(?:(?:https?|ftp)://)?(?:S+(?::S*)?@)?(?:(?:(?:[1-9]d?|1dd|2[01]d|22[0-3])(?:.(?:1?d{1,2}|2[0-4]d|25[0-5])){2}(?:.(?:[1-9]d?|1dd|2[0-4]d|25[0-4]))|(?:(?:(?:[a-zu00a1-uffff0-9]+-?)*[a-zu00a1-uffff0-9]+)(?:.(?:[a-zu00a1-uffff0-9]+-?)*[a-zu00a1-uffff0-9]+)*(?:.(?:[a-zu00a1-uffff]{2,}))((?<=(?<![-a-zu00a1-uffff0-9])example.com)|(?<=(?<![-a-zu00a1-uffff0-9])example1.com)|(?<=(?<![-a-zu00a1-uffff0-9])example2.com))?))|localhost)(?::d{2,5})?(?:/[^s]*)?

https://regex101.com/r/rCBd0P/1

Python Code Sample:

import re
 
def ConvertURL_func(input_text):
  #
  def repl(m):
    if m.group(1) == None: return "{Removed}"
    return m.group(0)
  #
  input_text = re.sub(r"(?!mailto:)(?:(?:https?|ftp)://)?(?:S+(?::S*)?@)?(?:(?:(?:[1-9]d?|1dd|2[01]d|22[0-3])(?:.(?:1?d{1,2}|2[0-4]d|25[0-5])){2}(?:.(?:[1-9]d?|1dd|2[0-4]d|25[0-4]))|(?:(?:(?:[a-zu00a1-uffff0-9]+-?)*[a-zu00a1-uffff0-9]+)(?:.(?:[a-zu00a1-uffff0-9]+-?)*[a-zu00a1-uffff0-9]+)*(?:.(?:[a-zu00a1-uffff]{2,}))((?<=(?<![-a-zu00a1-uffff0-9])example.com)|(?<=(?<![-a-zu00a1-uffff0-9])example1.com)|(?<=(?<![-a-zu00a1-uffff0-9])example2.com))?))|localhost)(?::d{2,5})?(?:/[^s]*)?",repl,input_text)
  return input_text

# input URL strings example:
input_text = '''
bad.com
www.example.com
example.com
www.badurl.com www.badurlexample2.com
www.badurl.com example1.com
https://www.example2.com
'''

input_text = ConvertURL_func(input_text)
print(input_text)

Outout:

>>> print(input_text)

{Removed}
www.example.com
example.com
{Removed} {Removed}
{Removed} example1.com
https://www.example2.com

>>>

Regex expanded:

 (?! mailto: )
 (?:
    (?: https? | ftp )
    ://
 )?
 (?:
    S+ 
    (?: : S* )?
    @
 )?
 (?:
    (?:
       (?:
          [1-9] d? 
        | 1 dd 
        | 2 [01] d 
        | 22 [0-3] 
       )
       (?:
          .
          (?: 1? d{1,2} | 2 [0-4] d | 25 [0-5] )
       ){2}
       (?:
          .
          (?:
             [1-9] d? 
           | 1 dd 
           | 2 [0-4] d 
           | 25 [0-4] 
          )
       )
     | 
       (?:
          (?:
             (?: [a-zu00a1-uffff0-9]+ -? )*
             [a-zu00a1-uffff0-9]+ 
          )
          (?:
             .
             (?: [a-zu00a1-uffff0-9]+ -? )*
             [a-zu00a1-uffff0-9]+ 
          )*
          (?:
             .
             (?: [a-zu00a1-uffff]{2,} )
          )
          (                           # (1 start)
             # Start Whitelist
             
             (?<=
                (?<! [-a-zu00a1-uffff0-9] )
                example.com
             )
           | (?<=
                (?<! [-a-zu00a1-uffff0-9] )
                example1.com
             )
           | (?<=
                (?<! [-a-zu00a1-uffff0-9] )
                example2.com
             )
             
             # Add more whitelist items
          )?                          # (1 end)
       )
    )
  | localhost
 )
 (?: : d{2,5} )?
 (?: / [^s]* )?
 
Answered By: sln