Regex for URL without path

Question:

I know there are many solutions, articles and libraries for this case, but couldn’t find one to match my case. I’m trying to write a regex to extract a URL(which represent the website) from a text (a signature of a person in an email), and has multiple cases:

  • Could contain http(s):// , or not
  • Could contain www. , or not
  • Could have multiple TLD such as "test.com.cn"

Here are some examples:

www.test.com
https://test.com.cn
http://www.test.com.cn
test.com
test.com.cn

I’ve come up with the following regex:

(https?://)?(www.)?w{2,}.[a-zA-Z]{2,}(.[a-zA-Z]{2,})?$

But there are two main problems with this, because the signature can contain an email address:

  1. It (wrongly) capture the TLDs of emails like this one: [email protected]
  2. It doesn’t capture URLS in the middle of a line, and if I remove the $ sign at the end, it captures the name.surname part of the last example

For (1) I tried using negative lookbehind, adding this (?<!@) to the beginning, the problem is that now it captures est2.com instead of not matching it at all.

Asked By: sagi

||

Answers:

I think you could use b (boundary) instead of $ (and at the beginning as well) and exclude @ in negative lookbehind and lookahead:

(?<!@|.|-)b(https?://)?(www.)?w{2,}.[a-zA-Z]{2,}(.[a-zA-Z]{2,})?b(?!@|.|-)

Edit: exclude the dot (and all non alphanumeric characters likely to occur in an URL/email address) in your lookarounds to avoid matching name.middlename in [email protected] or com.cn in [email protected]. See this answer for the list of characters

Answered By: Tranbi
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.