RegEx for extracting domains and subdomains

Question:

I’m trying to strip a bunch of websites down to their domain names i.e:

https://www.facebook.org/hello 

becomes facebook.org.

I’m using the regex pattern finder:

(https?://)?([wW]{3}.)?([w]*.w*)([/w]*)

This catches most cases but occasionally there will be websites such as:

http://www.xxxx.wordpress.com/hello

which I want to strip to xxxx.wordpress.com.

How can I identify those cases while still identifying all other normal entries?

Asked By: Matt

||

Answers:

Although Robert Harvey has suggested a useful method of urllib.parse, here’s my attempt at the regex:

(?:http[s]?://)?(?:www.)?([^/nrs]+.[^/nrs]+)(?:/)?(w+)?

As seen at regex101.com

Explanation –

First, the regex checks whether there is a https:// or http://. If so, it ignores it, but starts searching after that.

Then the regex checks for a www. – It’s important to note that this has been kept optional, so if the user enters my website is site.com, site.com will be matched.

[^/nrs]+.[^/nrs]+ matches the actual url you need, so it won’t have spaces or newlines. Oh, and there must be at least one period (.) in there.

Since your question looks like you want to match the sub directory as well, I’ve added (w+)? at the end.

TL;DR

Group 0 – Entire url

Group 1 – The domain name

Group 2 – The sub-directory

Answered By: Robo Mop

You expression seems to be working perfectly fine and it outputs what you might want to. I only added an i flag and slightly modify it to:

(https?://)?([w]{3}.)?(w*.w*)([/w]*)

RegEx

If this wasn’t your desired expression, you can modify/change your expressions in regex101.com.

enter image description here

RegEx Circuit

You can also visualize your expressions in jex.im:

enter image description here

Python Code

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(https?://)?([w]{3}.)?(w*.w*)([/w]*)"

test_str = ("https://www.facebook.org/hellon"
    "http://www.xxxx.wordpress.com/hellon"
    "http://www.xxxx.yyy.zzz.wordpress.com/hello")

subst = "\3"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

JavaScript Demo

const regex = /(https?://)?([w]{3}.)?(w*.w*)([/w]*)/gmi;
const str = `https://www.facebook.org/hello
http://www.xxxx.wordpress.com/hello
http://www.xxxx.yyy.zzz.wordpress.com/hello`;
const subst = `$3`;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: ', result);

Answered By: Emma
print("-------------")

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

    import re
    
    regex = r"(https?://)?([w]{3}.)?(w*.w*)([/w]*)"
    regex1 = r".?(microsoft.com.*)"
    test_str = (
    "https://blog.microsoft.com/test.htmln"
    "https://www.blog.microsoft.com/test/testn"
    "https://microsoft.comn"
    "http://www.blog.xyz.abc.microsoft.com/test/testn"
    "https://www.microsoft.com")
    
    subst = "\3"
    if test_str:
        print (test_str)
    
    print ("-----")
    # You can manually specify the number of replacements by changing the 4th argument
    result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
    if result:
        print (result)
    
    print ("-----")
    result = re.sub(regex1, "", result, 0, re.MULTILINE | re.IGNORECASE)
    if result:
        print (result)
    
    print ("-----")
    

    
Answered By: Ankita Patel

Nice post @Emma,
and thanks for that jex.im link, and the demos – I’m sure I’ll use that again.
Here’s one more with Named Capture Groups: (One Regex to rule them all)

Any Domain/Subdomain Match with MatchGroups, also handles Emails

https://stackoverflow.com/a/73653579/738895

Answered By: m1m1k