Extract domain name from URL in Python

Question:

I am tring to extract the domain names out of a list of URLs. Just like in
https://stackoverflow.com/questions/18331948/extract-domain-name-from-the-url
My problem is that the URLs can be about everything, few examples:
m.google.com => google
m.docs.google.com => google
www.someisotericdomain.innersite.mall.co.uk => mall
www.ouruniversity.department.mit.ac.us => mit
www.somestrangeurl.shops.relevantdomain.net => relevantdomain
www.example.info => example
And so on..
The diversity of the domains doesn’t allow me to use a regex as shown in how to get domain name from URL (because my script will be running on enormous amount of urls from real network traffic, the regex will have to be enormous in order to catch all kinds of domains as mentioned).
Unfortunately my web research the didn’t provide any efficient solution.
Does anyone have an idea of how to do this ?
Any help will be appreciated !
Thank you

Asked By: kobibo

||

Answers:

It seems you can use urlparse https://docs.python.org/3/library/urllib.parse.html for that url, and then extract the netloc.

And from the netloc you could easily extract the domain name by using split

Answered By: Mariano Anaya

With regex, you could use something like this:

(?<=.)([^.]+)(?:.(?:co.uk|ac.us|[^.]+(?:$|n)))

https://regex101.com/r/WQXFy6/5

Notice, you’ll have to watch out for special cases such as co.uk.

Answered By: oddRaven

Use tldextract which is more efficient version of urlparse, tldextract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and subdomains of a URL.

>>> import tldextract
>>> ext = tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
>>> ext.domain
'cnn'
Answered By: akash karothiya

Simple solution via regex

import re

def domain_name(url):
    return url.split("www.")[-1].split("//")[-1].split(".")[0]
Answered By: Sharif O

Check the replace and split methods.

PS: ONLY WORKS FOR SIMPLE LINKS LIKE https://youtube.com (output=youtube) AND (www.user.ru.com) (output=user)

def domain_name(url):

return url.replace("www.","http://").split("//")[1].split(".")[0]
Answered By: Denis
import re
def getDomain(url:str) -> str:
    '''
        Return the domain from any url
    '''
    # copy the original url text
    clean_url = url

    # take out protocol
    reg = re.findall(':[0-9]+',url)
    if len(reg) > 0:
        url = url.replace(reg[0],'')
    
    # take out paths routes
    if '/' in url:
        url = url.split('/')

    # select only the domain
    if 'http' in clean_url:
        url = url[2]

    # preparing for next operation
    url = ''.join(url)

    # select only domain
    url = '.'.join(url.split('.')[-2:])

    return url

Answered By: Jup
from urllib.parse import urlparse
import validators

    hostnames = []
    counter = 0
    errors = 0
    for row_orig in rows:
        try:
            row = row_orig.rstrip().lstrip().split(' ')[1].rstrip()
            if len(row) < 5:
                print(f"Empty row {row_orig}")
                errors += 1
                continue
            if row.startswith('http'):
                domain = urlparse(row).netloc # works for https and http
            else:
                domain = row

            if ':' in domain:
                domain = domain.split(':')[0] # split at port after clearing http/https protocol 

            # Finally validate it
            if validators.domain(domain):
                pass
            elif validators.ipv4(domain):
                pass
            else:
                print(f"Invalid domain/IP {domain}. RAW: {row}")
                errors +=1
                continue

            hostnames.append(domain)
            if counter % 10000 == 1:
                print(f"Added {counter}. Errors {errors}")
            counter+=1
        except:
            print("Error in extraction")
            errors += 1
Answered By: gies0r
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.