Validate a hostname string
Question:
Following up to Regular expression to match hostname or IP Address?
and using Restrictions on valid host names as a reference, what is the most readable, concise way to match/validate a hostname/fqdn (fully qualified domain name) in Python? I’ve answered with my attempt below, improvements welcome.
Answers:
Process each DNS label individually by excluding invalid characters and ensuring nonzero length.
def isValidHostname(hostname):
disallowed = re.compile("[^a-zA-Zd-]")
return all(map(lambda x: len(x) and not disallowed.search(x), hostname.split(".")))
import re
def is_valid_hostname(hostname):
if len(hostname) > 255:
return False
if hostname[-1] == ".":
hostname = hostname[:-1] # strip exactly one dot from the right, if present
allowed = re.compile("(?!-)[A-Zd-]{1,63}(?<!-)$", re.IGNORECASE)
return all(allowed.match(x) for x in hostname.split("."))
ensures that each segment
- contains at least one character and a maximum of 63 characters
- consists only of allowed characters
- doesn’t begin or end with a hyphen.
It also avoids double negatives (not disallowed
), and if hostname
ends in a .
, that’s OK, too. It will (and should) fail if hostname
ends in more than one dot.
If you’re looking to validate the name of an existing host, the best way is to try to resolve it. You’ll never write a regular expression to provide that level of validation.
I like the thoroughness of Tim Pietzcker’s answer, but I prefer to offload some of the logic from regular expressions for readability. Honestly, I had to look up the meaning of those (?
“extension notation” parts. Additionally, I feel the “double-negative” approach is more obvious in that it limits the responsibility of the regular expression to just finding any invalid character. I do like that re.IGNORECASE allows the regex to be shortened.
So here’s another shot; it’s longer but it reads kind of like prose. I suppose “readable” is somewhat at odds with “concise”. I believe all of the validation constraints mentioned in the thread so far are covered:
def isValidHostname(hostname):
if len(hostname) > 255:
return False
if hostname.endswith("."): # A single trailing dot is legal
hostname = hostname[:-1] # strip exactly one dot from the right, if present
disallowed = re.compile("[^A-Zd-]", re.IGNORECASE)
return all( # Split by labels and verify individually
(label and len(label) <= 63 # length is within proper range
and not label.startswith("-") and not label.endswith("-") # no bordering hyphens
and not disallowed.search(label)) # contains only legal characters
for label in hostname.split("."))
def is_valid_host(host):
'''IDN compatible domain validator'''
host = host.encode('idna').lower()
if not hasattr(is_valid_host, '_re'):
import re
is_valid_host._re = re.compile(r'^([0-9a-z][-w]*[0-9a-z].)+[a-z0-9-]{2,15}$')
return bool(is_valid_host._re.match(host))
Here’s a bit stricter version of Tim Pietzcker’s answer with the following improvements:
- Limit the length of the hostname to 253 characters (after stripping the optional trailing dot).
- Limit the character set to ASCII (i.e. use
[0-9]
instead of d
).
- Check that the TLD is not all-numeric.
import re
def is_valid_hostname(hostname):
if hostname[-1] == ".":
# strip exactly one dot from the right, if present
hostname = hostname[:-1]
if len(hostname) > 253:
return False
labels = hostname.split(".")
# the TLD must be not all-numeric
if re.match(r"[0-9]+$", labels[-1]):
return False
allowed = re.compile(r"(?!-)[a-z0-9-]{1,63}(?<!-)$", re.IGNORECASE)
return all(allowed.match(label) for label in labels)
Per The Old New Thing, the maximum length of a DNS name is 253 characters. (One is allowed up to 255 octets, but 2 of those are consumed by the encoding.)
import re
def validate_fqdn(dn):
if dn.endswith('.'):
dn = dn[:-1]
if len(dn) < 1 or len(dn) > 253:
return False
ldh_re = re.compile('^[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?$',
re.IGNORECASE)
return all(ldh_re.match(x) for x in dn.split('.'))
One could argue for accepting empty domain names, or not, depending on one’s purpose.
Complimentary to the @TimPietzcker answer.
Underscore is a valid hostname character (but not for domain name) . While double dash is commonly found for IDN punycode domain(e.g. xn--). Port number should be stripped. This is the cleanup of the code.
import re
def is_valid_hostname(hostname):
if len(hostname) > 255:
return False
hostname = hostname.rstrip(".")
allowed = re.compile("(?!-)[A-Zd-_]{1,63}(?<!-)$", re.IGNORECASE)
return all(allowed.match(x) for x in hostname.split("."))
# convert your unicode hostname to punycode (python 3 )
# Remove the port number from hostname
normalise_host = hostname.encode("idna").decode().split(":")[0]
is_valid_hostname(normalise_host )
I think this regex might help in Python:
‘^([a-zA-Z0-9]+(.|-))*[a-zA-Z0-9]+$’
Don’t reinvent the wheel. You can use a library, e.g. validators. Or you can copy their code:
Installation
pip install validators
Usage
import validators
if validators.domain('example.com')
print('this domain is valid')
Following up to Regular expression to match hostname or IP Address?
and using Restrictions on valid host names as a reference, what is the most readable, concise way to match/validate a hostname/fqdn (fully qualified domain name) in Python? I’ve answered with my attempt below, improvements welcome.
Process each DNS label individually by excluding invalid characters and ensuring nonzero length.
def isValidHostname(hostname):
disallowed = re.compile("[^a-zA-Zd-]")
return all(map(lambda x: len(x) and not disallowed.search(x), hostname.split(".")))
import re
def is_valid_hostname(hostname):
if len(hostname) > 255:
return False
if hostname[-1] == ".":
hostname = hostname[:-1] # strip exactly one dot from the right, if present
allowed = re.compile("(?!-)[A-Zd-]{1,63}(?<!-)$", re.IGNORECASE)
return all(allowed.match(x) for x in hostname.split("."))
ensures that each segment
- contains at least one character and a maximum of 63 characters
- consists only of allowed characters
- doesn’t begin or end with a hyphen.
It also avoids double negatives (not disallowed
), and if hostname
ends in a .
, that’s OK, too. It will (and should) fail if hostname
ends in more than one dot.
If you’re looking to validate the name of an existing host, the best way is to try to resolve it. You’ll never write a regular expression to provide that level of validation.
I like the thoroughness of Tim Pietzcker’s answer, but I prefer to offload some of the logic from regular expressions for readability. Honestly, I had to look up the meaning of those (?
“extension notation” parts. Additionally, I feel the “double-negative” approach is more obvious in that it limits the responsibility of the regular expression to just finding any invalid character. I do like that re.IGNORECASE allows the regex to be shortened.
So here’s another shot; it’s longer but it reads kind of like prose. I suppose “readable” is somewhat at odds with “concise”. I believe all of the validation constraints mentioned in the thread so far are covered:
def isValidHostname(hostname):
if len(hostname) > 255:
return False
if hostname.endswith("."): # A single trailing dot is legal
hostname = hostname[:-1] # strip exactly one dot from the right, if present
disallowed = re.compile("[^A-Zd-]", re.IGNORECASE)
return all( # Split by labels and verify individually
(label and len(label) <= 63 # length is within proper range
and not label.startswith("-") and not label.endswith("-") # no bordering hyphens
and not disallowed.search(label)) # contains only legal characters
for label in hostname.split("."))
def is_valid_host(host):
'''IDN compatible domain validator'''
host = host.encode('idna').lower()
if not hasattr(is_valid_host, '_re'):
import re
is_valid_host._re = re.compile(r'^([0-9a-z][-w]*[0-9a-z].)+[a-z0-9-]{2,15}$')
return bool(is_valid_host._re.match(host))
Here’s a bit stricter version of Tim Pietzcker’s answer with the following improvements:
- Limit the length of the hostname to 253 characters (after stripping the optional trailing dot).
- Limit the character set to ASCII (i.e. use
[0-9]
instead ofd
). - Check that the TLD is not all-numeric.
import re
def is_valid_hostname(hostname):
if hostname[-1] == ".":
# strip exactly one dot from the right, if present
hostname = hostname[:-1]
if len(hostname) > 253:
return False
labels = hostname.split(".")
# the TLD must be not all-numeric
if re.match(r"[0-9]+$", labels[-1]):
return False
allowed = re.compile(r"(?!-)[a-z0-9-]{1,63}(?<!-)$", re.IGNORECASE)
return all(allowed.match(label) for label in labels)
Per The Old New Thing, the maximum length of a DNS name is 253 characters. (One is allowed up to 255 octets, but 2 of those are consumed by the encoding.)
import re
def validate_fqdn(dn):
if dn.endswith('.'):
dn = dn[:-1]
if len(dn) < 1 or len(dn) > 253:
return False
ldh_re = re.compile('^[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?$',
re.IGNORECASE)
return all(ldh_re.match(x) for x in dn.split('.'))
One could argue for accepting empty domain names, or not, depending on one’s purpose.
Complimentary to the @TimPietzcker answer.
Underscore is a valid hostname character (but not for domain name) . While double dash is commonly found for IDN punycode domain(e.g. xn--). Port number should be stripped. This is the cleanup of the code.
import re
def is_valid_hostname(hostname):
if len(hostname) > 255:
return False
hostname = hostname.rstrip(".")
allowed = re.compile("(?!-)[A-Zd-_]{1,63}(?<!-)$", re.IGNORECASE)
return all(allowed.match(x) for x in hostname.split("."))
# convert your unicode hostname to punycode (python 3 )
# Remove the port number from hostname
normalise_host = hostname.encode("idna").decode().split(":")[0]
is_valid_hostname(normalise_host )
I think this regex might help in Python:
‘^([a-zA-Z0-9]+(.|-))*[a-zA-Z0-9]+$’
Don’t reinvent the wheel. You can use a library, e.g. validators. Or you can copy their code:
Installation
pip install validators
Usage
import validators
if validators.domain('example.com')
print('this domain is valid')