RegEx for extracting domains and subdomains
Question:
I’m trying to strip a bunch of websites down to their domain names i.e:
https://www.facebook.org/hello
becomes facebook.org
.
I’m using the regex pattern finder:
(https?://)?([wW]{3}.)?([w]*.w*)([/w]*)
This catches most cases but occasionally there will be websites such as:
http://www.xxxx.wordpress.com/hello
which I want to strip to xxxx.wordpress.com
.
How can I identify those cases while still identifying all other normal entries?
Answers:
Although Robert Harvey has suggested a useful method of urllib.parse
, here’s my attempt at the regex:
(?:http[s]?://)?(?:www.)?([^/nrs]+.[^/nrs]+)(?:/)?(w+)?
As seen at regex101.com
Explanation –
First, the regex checks whether there is a https://
or http://
. If so, it ignores it, but starts searching after that.
Then the regex checks for a www.
– It’s important to note that this has been kept optional, so if the user enters my website is site.com
, site.com
will be matched.
[^/nrs]+.[^/nrs]+
matches the actual url you need, so it won’t have spaces or newlines. Oh, and there must be at least one period (.
) in there.
Since your question looks like you want to match the sub directory as well, I’ve added (w+)?
at the end.
TL;DR
Group 0 – Entire url
Group 1 – The domain name
Group 2 – The sub-directory
You expression seems to be working perfectly fine and it outputs what you might want to. I only added an i
flag and slightly modify it to:
(https?://)?([w]{3}.)?(w*.w*)([/w]*)
RegEx
If this wasn’t your desired expression, you can modify/change your expressions in regex101.com.
RegEx Circuit
You can also visualize your expressions in jex.im:
Python Code
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(https?://)?([w]{3}.)?(w*.w*)([/w]*)"
test_str = ("https://www.facebook.org/hellon"
"http://www.xxxx.wordpress.com/hellon"
"http://www.xxxx.yyy.zzz.wordpress.com/hello")
subst = "\3"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
JavaScript Demo
const regex = /(https?://)?([w]{3}.)?(w*.w*)([/w]*)/gmi;
const str = `https://www.facebook.org/hello
http://www.xxxx.wordpress.com/hello
http://www.xxxx.yyy.zzz.wordpress.com/hello`;
const subst = `$3`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
print("-------------")
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(https?://)?([w]{3}.)?(w*.w*)([/w]*)"
regex1 = r".?(microsoft.com.*)"
test_str = (
"https://blog.microsoft.com/test.htmln"
"https://www.blog.microsoft.com/test/testn"
"https://microsoft.comn"
"http://www.blog.xyz.abc.microsoft.com/test/testn"
"https://www.microsoft.com")
subst = "\3"
if test_str:
print (test_str)
print ("-----")
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
if result:
print (result)
print ("-----")
result = re.sub(regex1, "", result, 0, re.MULTILINE | re.IGNORECASE)
if result:
print (result)
print ("-----")
Nice post @Emma,
and thanks for that jex.im link, and the demos – I’m sure I’ll use that again.
Here’s one more with Named Capture Groups: (One Regex to rule them all)
Any Domain/Subdomain Match with MatchGroups, also handles Emails
I’m trying to strip a bunch of websites down to their domain names i.e:
https://www.facebook.org/hello
becomes facebook.org
.
I’m using the regex pattern finder:
(https?://)?([wW]{3}.)?([w]*.w*)([/w]*)
This catches most cases but occasionally there will be websites such as:
http://www.xxxx.wordpress.com/hello
which I want to strip to xxxx.wordpress.com
.
How can I identify those cases while still identifying all other normal entries?
Although Robert Harvey has suggested a useful method of urllib.parse
, here’s my attempt at the regex:
(?:http[s]?://)?(?:www.)?([^/nrs]+.[^/nrs]+)(?:/)?(w+)?
As seen at regex101.com
Explanation –
First, the regex checks whether there is a https://
or http://
. If so, it ignores it, but starts searching after that.
Then the regex checks for a www.
– It’s important to note that this has been kept optional, so if the user enters my website is site.com
, site.com
will be matched.
[^/nrs]+.[^/nrs]+
matches the actual url you need, so it won’t have spaces or newlines. Oh, and there must be at least one period (.
) in there.
Since your question looks like you want to match the sub directory as well, I’ve added (w+)?
at the end.
TL;DR
Group 0 – Entire url
Group 1 – The domain name
Group 2 – The sub-directory
You expression seems to be working perfectly fine and it outputs what you might want to. I only added an i
flag and slightly modify it to:
(https?://)?([w]{3}.)?(w*.w*)([/w]*)
RegEx
If this wasn’t your desired expression, you can modify/change your expressions in regex101.com.
RegEx Circuit
You can also visualize your expressions in jex.im:
Python Code
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(https?://)?([w]{3}.)?(w*.w*)([/w]*)"
test_str = ("https://www.facebook.org/hellon"
"http://www.xxxx.wordpress.com/hellon"
"http://www.xxxx.yyy.zzz.wordpress.com/hello")
subst = "\3"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
JavaScript Demo
const regex = /(https?://)?([w]{3}.)?(w*.w*)([/w]*)/gmi;
const str = `https://www.facebook.org/hello
http://www.xxxx.wordpress.com/hello
http://www.xxxx.yyy.zzz.wordpress.com/hello`;
const subst = `$3`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
print("-------------")
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(https?://)?([w]{3}.)?(w*.w*)([/w]*)"
regex1 = r".?(microsoft.com.*)"
test_str = (
"https://blog.microsoft.com/test.htmln"
"https://www.blog.microsoft.com/test/testn"
"https://microsoft.comn"
"http://www.blog.xyz.abc.microsoft.com/test/testn"
"https://www.microsoft.com")
subst = "\3"
if test_str:
print (test_str)
print ("-----")
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
if result:
print (result)
print ("-----")
result = re.sub(regex1, "", result, 0, re.MULTILINE | re.IGNORECASE)
if result:
print (result)
print ("-----")
Nice post @Emma,
and thanks for that jex.im link, and the demos – I’m sure I’ll use that again.
Here’s one more with Named Capture Groups: (One Regex to rule them all)