Removing subdomains from string domain name
Question:
So I have written a small function to remove sub-domains (if any) from string of input domains:
def rm(text):
print(text.replace(text, '.'.join(text.split('.')[-2:])), end="")
print("n")
if __name__ == "__main__":
rm("me.apple.com")
rm("not.me.apple.com")
rm("really.not.me.apple.com")
# problem here
rm("bbc.co.uk")
It all but works fine until you have .something.something
tld., like .co.uk
or .co.in
.
So my output is:
apple.com
apple.com
apple.com
--> co.uk
Where it should have been,
apple.com
apple.com
apple.com
bbc.co.uk
How do I fix/create the function in an elegant way instead of checking for all possible double tlds?
Edit: I will have to check millions of domains, if that matters. So what I would do is to pass a domain to my function and get a clean, subdomain free domain.
Answers:
You can’t. Not without querying some sort of service–DNS at a minimum–or encoding a database of answers in your function.
Why not? Because you can’t describe precisely in words what you are trying to do. For example, “me.apple.com” should resolve to “apple.com”, “me.apple.co.uk” should resolve to “apple.co.uk”, but what should “a.b.c.d.e” resolve to? There’s no way to know unless the examples are cherry-picked in a way that their content suggests (but still does not define) the right answer.
Once you come up with a textual description of the algorithm, it will be implementable.
You can use a “whois” service to do the heavy lifting: https://www.whois.com/whois/ – this does what you want if you’re willing to make HTTP requests.
The tldextract package should do the heavy lifting for you, based on the public suffix list. It isn’t bullet proof, but should work for all the reasonable usecases:
import tldextract
def rm(text):
return tldextract.extract(text).registered_domain
So I have written a small function to remove sub-domains (if any) from string of input domains:
def rm(text):
print(text.replace(text, '.'.join(text.split('.')[-2:])), end="")
print("n")
if __name__ == "__main__":
rm("me.apple.com")
rm("not.me.apple.com")
rm("really.not.me.apple.com")
# problem here
rm("bbc.co.uk")
It all but works fine until you have .something.something
tld., like .co.uk
or .co.in
.
So my output is:
apple.com
apple.com
apple.com
--> co.uk
Where it should have been,
apple.com
apple.com
apple.com
bbc.co.uk
How do I fix/create the function in an elegant way instead of checking for all possible double tlds?
Edit: I will have to check millions of domains, if that matters. So what I would do is to pass a domain to my function and get a clean, subdomain free domain.
You can’t. Not without querying some sort of service–DNS at a minimum–or encoding a database of answers in your function.
Why not? Because you can’t describe precisely in words what you are trying to do. For example, “me.apple.com” should resolve to “apple.com”, “me.apple.co.uk” should resolve to “apple.co.uk”, but what should “a.b.c.d.e” resolve to? There’s no way to know unless the examples are cherry-picked in a way that their content suggests (but still does not define) the right answer.
Once you come up with a textual description of the algorithm, it will be implementable.
You can use a “whois” service to do the heavy lifting: https://www.whois.com/whois/ – this does what you want if you’re willing to make HTTP requests.
The tldextract package should do the heavy lifting for you, based on the public suffix list. It isn’t bullet proof, but should work for all the reasonable usecases:
import tldextract
def rm(text):
return tldextract.extract(text).registered_domain