How to validate a url in Python? (Malformed or not)
Question:
I have url
from the user and I have to reply with the fetched HTML.
How can I check for the URL to be malformed or not?
For example :
url = 'google' # Malformed
url = 'google.com' # Malformed
url = 'http://google.com' # Valid
url = 'http://google' # Malformed
Answers:
django url validation regex (source):
import re
regex = re.compile(
r'^(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?.)+(?:[A-Z]{2,6}.?|[A-Z0-9-]{2,}.?)|' #domain...
r'localhost|' #localhost...
r'd{1,3}.d{1,3}.d{1,3}.d{1,3})' # ...or ip
r'(?::d+)?' # optional port
r'(?:/?|[/?]S+)$', re.IGNORECASE)
print(re.match(regex, "http://www.example.com") is not None) # True
print(re.match(regex, "example.com") is not None) # False
Actually, I think this is the best way.
from django.core.validators import URLValidator
from django.core.exceptions import ValidationError
val = URLValidator(verify_exists=False)
try:
val('http://www.google.com')
except ValidationError, e:
print e
If you set verify_exists
to True
, it will actually verify that the URL exists, otherwise it will just check if it’s formed correctly.
edit: ah yeah, this question is a duplicate of this: How can I check if a URL exists with Django’s validators?
note – lepl is no longer supported, sorry (you’re welcome to use it, and i think the code below works, but it’s not going to get updates).
rfc 3696 http://www.faqs.org/rfcs/rfc3696.html defines how to do this (for http urls and email). i implemented its recommendations in python using lepl (a parser library). see http://acooke.org/lepl/rfc3696.html
to use:
> easy_install lepl
...
> python
...
>>> from lepl.apps.rfc3696 import HttpUrl
>>> validator = HttpUrl()
>>> validator('google')
False
>>> validator('http://google')
False
>>> validator('http://google.com')
True
Use the validators package:
>>> import validators
>>> validators.url("http://google.com")
True
>>> validators.url("http://google")
ValidationFailure(func=url, args={'value': 'http://google', 'require_tld': True})
>>> if not validators.url("http://google"):
... print "not valid"
...
not valid
>>>
Install it from PyPI with pip (pip install validators
).
I landed on this page trying to figure out a sane way to validate strings as “valid” urls. I share here my solution using python3. No extra libraries required.
See https://docs.python.org/2/library/urlparse.html if you are using python2.
See https://docs.python.org/3.0/library/urllib.parse.html if you are using python3 as I am.
import urllib
from pprint import pprint
invalid_url = 'dkakasdkjdjakdjadjfalskdjfalk'
valid_url = 'https://stackoverflow.com'
tokens = [urllib.parse.urlparse(url) for url in (invalid_url, valid_url)]
for token in tokens:
pprint(token)
min_attributes = ('scheme', 'netloc') # add attrs to your liking
for token in tokens:
if not all([getattr(token, attr) for attr in min_attributes]):
error = "'{url}' string has no scheme or netloc.".format(url=token.geturl())
print(error)
else:
print("'{url}' is probably a valid url.".format(url=token.geturl()))
ParseResult(scheme=”, netloc=”, path=’dkakasdkjdjakdjadjfalskdjfalk’, params=”, query=”, fragment=”)
ParseResult(scheme=’https’, netloc=’stackoverflow.com’, path=”, params=”, query=”, fragment=”)
‘dkakasdkjdjakdjadjfalskdjfalk’ string has no scheme or netloc.
‘https://stackoverflow.com‘ is probably a valid url.
Here is a more concise function:
from urllib.parse import urlparse
min_attributes = ('scheme', 'netloc')
def is_valid(url, qualifying=min_attributes):
tokens = urlparse(url)
return all([getattr(tokens, qualifying_attr)
for qualifying_attr in qualifying])
A True or False version, based on @DMfll answer:
try:
# python2
from urlparse import urlparse
except:
# python3
from urllib.parse import urlparse
a = 'http://www.cwi.nl:80/%7Eguido/Python.html'
b = '/data/Python.html'
c = 532
d = u'dkakasdkjdjakdjadjfalskdjfalk'
e = 'https://stackoverflow.com'
def uri_validator(x):
try:
result = urlparse(x)
return all([result.scheme, result.netloc])
except:
return False
print(uri_validator(a))
print(uri_validator(b))
print(uri_validator(c))
print(uri_validator(d))
print(uri_validator(e))
Gives:
True
False
False
False
True
EDIT
As pointed out by @Kwame , the below code does validate the url even if the .com
or .co
etc are not present.
also pointed out by @Blaise, URLs like https://www.google is a valid URL
and you need to do a DNS check for checking if it resolves or not, separately.
This is simple and works:
So min_attr
contains the basic set of strings that needs to be present to define the validity of a URL,
i.e http://
part and google.com
part.
urlparse.scheme
stores http://
and
urlparse.netloc
store the domain name google.com
from urlparse import urlparse
def url_check(url):
min_attr = ('scheme' , 'netloc')
try:
result = urlparse(url)
if all([result.scheme, result.netloc]):
return True
else:
return False
except:
return False
all()
returns true if all the variables inside it return true.
So if result.scheme
and result.netloc
is present i.e. has some value then the URL is valid and hence returns True
.
Nowadays, I use the following, based on the Padam’s answer:
$ python --version
Python 3.6.5
And this is how it looks:
from urllib.parse import urlparse
def is_url(url):
try:
result = urlparse(url)
return all([result.scheme, result.netloc])
except ValueError:
return False
Just use is_url("http://www.asdf.com")
.
Hope it helps!
Validate URL with urllib
and Django-like regex
The Django URL validation regex was actually pretty good but I needed to tweak it a little bit for my use case. Feel free to adapt it to yours!
Python 3.7
import re
import urllib
# Check https://regex101.com/r/A326u1/5 for reference
DOMAIN_FORMAT = re.compile(
r"(?:^(w{1,255}):(.{1,255})@|^)" # http basic authentication [optional]
r"(?:(?:(?=S{0,253}(?:$|:))" # check full domain length to be less than or equal to 253 (starting after http basic auth, stopping before port)
r"((?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?.)+" # check for at least one subdomain (maximum length per subdomain: 63 characters), dashes in between allowed
r"(?:[a-z0-9]{1,63})))" # check for top level domain, no dashes allowed
r"|localhost)" # accept also "localhost" only
r"(:d{1,5})?", # port [optional]
re.IGNORECASE
)
SCHEME_FORMAT = re.compile(
r"^(http|hxxp|ftp|fxp)s?$", # scheme: http(s) or ftp(s)
re.IGNORECASE
)
def validate_url(url: str):
url = url.strip()
if not url:
raise Exception("No URL specified")
if len(url) > 2048:
raise Exception("URL exceeds its maximum length of 2048 characters (given length={})".format(len(url)))
result = urllib.parse.urlparse(url)
scheme = result.scheme
domain = result.netloc
if not scheme:
raise Exception("No URL scheme specified")
if not re.fullmatch(SCHEME_FORMAT, scheme):
raise Exception("URL scheme must either be http(s) or ftp(s) (given scheme={})".format(scheme))
if not domain:
raise Exception("No URL domain specified")
if not re.fullmatch(DOMAIN_FORMAT, domain):
raise Exception("URL domain malformed (domain={})".format(domain))
return url
Explanation
- The code only validates the
scheme
and netloc
part of a given URL. (To do this properly, I split the URL with urllib.parse.urlparse()
in the two according parts which are then matched with the corresponding regex terms.)
-
The netloc
part stops before the first occurrence of a slash /
, so port
numbers are still part of the netloc
, e.g.:
https://www.google.com:80/search?q=python
^^^^^ ^^^^^^^^^^^^^^^^^
| |
| +-- netloc (aka "domain" in my code)
+-- scheme
-
IPv4 addresses are also validated
IPv6 Support
If you want the URL validator to also work with IPv6 addresses, do the following:
- Add
is_valid_ipv6(ip)
from Markus Jarderot’s answer, which has a really good IPv6 validator regex
- Add
and not is_valid_ipv6(domain)
to the last if
Examples
Here are some examples of the regex for the netloc
(aka domain
) part in action:
- IPv4 and alphanumeric: https://regex101.com/r/A326u1/5
- IPv6: https://regex101.com/r/lKIIgq/1 (with the regex from Markus Jarderot’s answer)
All of the above solutions recognize a string like “http://www.google.com/path,www.yahoo.com/path” as valid. This solution always works as it should
import re
# URL-link validation
ip_middle_octet = u"(?:.(?:1?d{1,2}|2[0-4]d|25[0-5]))"
ip_last_octet = u"(?:.(?:[1-9]d?|1dd|2[0-4]d|25[0-4]))"
URL_PATTERN = re.compile(
u"^"
# protocol identifier
u"(?:(?:https?|ftp|rtsp|rtp|mmp)://)"
# user:pass authentication
u"(?:S+(?::S*)?@)?"
u"(?:"
u"(?P<private_ip>"
# IP address exclusion
# private & local networks
u"(?:localhost)|"
u"(?:(?:10|127)" + ip_middle_octet + u"{2}" + ip_last_octet + u")|"
u"(?:(?:169.254|192.168)" + ip_middle_octet + ip_last_octet + u")|"
u"(?:172.(?:1[6-9]|2d|3[0-1])" + ip_middle_octet + ip_last_octet + u"))"
u"|"
# IP address dotted notation octets
# excludes loopback network 0.0.0.0
# excludes reserved space >= 224.0.0.0
# excludes network & broadcast addresses
# (first & last IP address of each class)
u"(?P<public_ip>"
u"(?:[1-9]d?|1dd|2[01]d|22[0-3])"
u"" + ip_middle_octet + u"{2}"
u"" + ip_last_octet + u")"
u"|"
# host name
u"(?:(?:[a-zu00a1-uffff0-9_-]-?)*[a-zu00a1-uffff0-9_-]+)"
# domain name
u"(?:.(?:[a-zu00a1-uffff0-9_-]-?)*[a-zu00a1-uffff0-9_-]+)*"
# TLD identifier
u"(?:.(?:[a-zu00a1-uffff]{2,}))"
u")"
# port number
u"(?::d{2,5})?"
# resource path
u"(?:/S*)?"
# query string
u"(?:?S*)?"
u"$",
re.UNICODE | re.IGNORECASE
)
def url_validate(url):
""" URL string validation
"""
return re.compile(URL_PATTERN).match(url)
Not directly relevant, but often it’s required to identify whether some token CAN be a url or not, not necessarily 100% correctly formed (ie, https part omitted and so on). I’ve read this post and did not find the solution, so I am posting my own here for the sake of completeness.
def get_domain_suffixes():
import requests
res=requests.get('https://publicsuffix.org/list/public_suffix_list.dat')
lst=set()
for line in res.text.split('n'):
if not line.startswith('//'):
domains=line.split('.')
cand=domains[-1]
if cand:
lst.add('.'+cand)
return tuple(sorted(lst))
domain_suffixes=get_domain_suffixes()
def reminds_url(txt:str):
"""
>>> reminds_url('yandex.ru.com/somepath')
True
"""
ltext=txt.lower().split('/')[0]
return ltext.startswith(('http','www','ftp')) or ltext.endswith(domain_suffixes)
Here’s a regex solution since top voted regex doesn’t work for weird cases like top-level domain. Some test cases down below.
regex = re.compile(
r"(w+://)?" # protocol (optional)
r"(w+.)?" # host (optional)
r"((w+).(w+))" # domain
r"(.w+)*" # top-level domain (optional, can have > 1)
r"([w-._~/]*)*(?<!.)" # path, params, anchors, etc. (optional)
)
cases = [
"http://www.google.com",
"https://www.google.com",
"http://google.com",
"https://google.com",
"www.google.com",
"google.com",
"http://www.google.com/~as_db3.2123/134-1a",
"https://www.google.com/~as_db3.2123/134-1a",
"http://google.com/~as_db3.2123/134-1a",
"https://google.com/~as_db3.2123/134-1a",
"www.google.com/~as_db3.2123/134-1a",
"google.com/~as_db3.2123/134-1a",
# .co.uk top level
"http://www.google.co.uk",
"https://www.google.co.uk",
"http://google.co.uk",
"https://google.co.uk",
"www.google.co.uk",
"google.co.uk",
"http://www.google.co.uk/~as_db3.2123/134-1a",
"https://www.google.co.uk/~as_db3.2123/134-1a",
"http://google.co.uk/~as_db3.2123/134-1a",
"https://google.co.uk/~as_db3.2123/134-1a",
"www.google.co.uk/~as_db3.2123/134-1a",
"google.co.uk/~as_db3.2123/134-1a",
"https://...",
"https://..",
"https://.",
"https://.google.com",
"https://..google.com",
"https://...google.com",
"https://.google..com",
"https://.google...com"
"https://...google..com",
"https://...google...com",
".google.com",
".google.co."
"https://google.co."
]
for c in cases:
print(c, regex.match(c).span()[1] - regex.match(c).span()[0] == len(c))
Function based on Dominic Tarro answer:
import re
def is_url(x):
return bool(re.match(
r"(https?|ftp)://" # protocol
r"(w+(-w+)*.)?" # host (optional)
r"((w+(-w+)*).(w+))" # domain
r"(.w+)*" # top-level domain (optional, can have > 1)
r"([w-._~/]*)*(?<!.)" # path, params, anchors, etc. (optional)
, x))
Pydantic could be used to do that. I’m not very used to it so I can’t say about it’s limitations. It is an option thou and no one suggested it.
I have seen that many people questioned about ftp and files URL in previous answers so I recommend to get known to the documentation as Pydantic have many types for validation as FileUrl, AnyUrl and even database url types.
A simplistic usage example:
from requests import get, HTTPError, ConnectionError
from pydantic import BaseModel, AnyHttpUrl, ValidationError
class MyConfModel(BaseModel):
URI: AnyHttpUrl
try:
myAddress = MyConfModel(URI = "http://myurl.com/")
req = get(myAddress.URI, verify=False)
print(myAddress.URI)
except(ValidationError):
print('Invalid destination')
Pydantic also raises exceptions (pydantic.ValidationError) that can be used to handle errors.
I have teste it with these patterns:
- http://localhost (pass)
- http://localhost:8080 (pass)
- http://example.com (pass)
- http://user:[email protected] (pass)
- http://_example.com (pass)
- http://&example.com (fails)
- http://-example.com (fails)
from urllib.parse import urlparse
def is_valid_url(url):
try:
result = urlparse(url)
return all([result.scheme, result.netloc])
except ValueError:
return False
url = 'http://google.com'
if is_valid_url(url):
print('Valid URL')
else:
print('Malformed URL')
Use this example to conduct your own meaning of an "URL", and apply it everywhere in your code:
# DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
# Version 2, December 2004
#
# Copyright (C) 2004 Sam Hocevar <[email protected]>
#
# Everyone is permitted to copy and distribute verbatim or modified
# copies of this license document, and changing it is allowed as long
# as the name is changed.
#
# DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
# TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
#
# 0. You just DO WHAT THE FUCK YOU WANT TO.
#
# Copyright © 2023 Anthony [email protected]
#
# This work is free. You can redistribute it and/or modify it under the
# terms of the Do What The Fuck You Want To Public License, Version 2,
# as published by Sam Hocevar. See the LICENSE file for more details.
import operator as op
from urllib.parse import (
ParseResult,
urlparse,
)
import attrs
import pytest
from phantom import Phantom
from phantom.fn import compose2
def is_url_address(value: str) -> bool:
return any(urlparse(value))
class URL(str, Phantom, predicate=is_url_address):
pass
# presume that an empty URL is a nonsense
def test_empty_url():
with pytest.raises(TypeError, match="Could not parse .* from ''"):
URL.parse("")
# is it enough now?
def test_url():
assert URL.parse("http://")
scheme_and_netloc = op.attrgetter("scheme", "netloc")
def has_scheme_and_netloc(value: ParseResult) -> bool:
return all(scheme_and_netloc(value))
# need a bit of FP magic here
class ReachableURL(URL, predicate=compose2(has_scheme_and_netloc, urlparse)):
pass
def test_empty_reachable_url():
with pytest.raises(TypeError, match="Could not parse .* from ''"):
ReachableURL.parse("")
# but "empty" for an URL is not just "empty string"
def test_reachable_url_probably_wrong_host():
assert ReachableURL.parse("http://example")
def test_reachable_url():
assert ReachableURL.parse("http://example.com")
def test_reachable_url_without_scheme():
with pytest.raises(TypeError, match="Could not parse .* from 'example.com'"):
ReachableURL.parse("example.com")
# constructor works too
def test_constructor():
assert ReachableURL("http://example.com")
# but it *is* `str`
def test_url_is_str():
assert isinstance(ReachableURL("http://example.com"), str)
# now we can write plain old classes utilizing our `URL` and `ReachableURL`
# I'm lazy...
@attrs.define
class Person:
homepage: ReachableURL
def test_person():
person = Person(homepage=ReachableURL("https://example.com/index.html"))
assert person.homepage
def greet(person: Person) -> None:
print(f"Hello! I will definitely visit you at {person.homepage}.")
if __name__ == "__main__":
greet(Person(homepage=ReachableURL.parse("tg://resolve?username")))
It will not be surprising if an URL RFC turns out to be Turing-complete!
I have url
from the user and I have to reply with the fetched HTML.
How can I check for the URL to be malformed or not?
For example :
url = 'google' # Malformed
url = 'google.com' # Malformed
url = 'http://google.com' # Valid
url = 'http://google' # Malformed
django url validation regex (source):
import re
regex = re.compile(
r'^(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?.)+(?:[A-Z]{2,6}.?|[A-Z0-9-]{2,}.?)|' #domain...
r'localhost|' #localhost...
r'd{1,3}.d{1,3}.d{1,3}.d{1,3})' # ...or ip
r'(?::d+)?' # optional port
r'(?:/?|[/?]S+)$', re.IGNORECASE)
print(re.match(regex, "http://www.example.com") is not None) # True
print(re.match(regex, "example.com") is not None) # False
Actually, I think this is the best way.
from django.core.validators import URLValidator
from django.core.exceptions import ValidationError
val = URLValidator(verify_exists=False)
try:
val('http://www.google.com')
except ValidationError, e:
print e
If you set verify_exists
to True
, it will actually verify that the URL exists, otherwise it will just check if it’s formed correctly.
edit: ah yeah, this question is a duplicate of this: How can I check if a URL exists with Django’s validators?
note – lepl is no longer supported, sorry (you’re welcome to use it, and i think the code below works, but it’s not going to get updates).
rfc 3696 http://www.faqs.org/rfcs/rfc3696.html defines how to do this (for http urls and email). i implemented its recommendations in python using lepl (a parser library). see http://acooke.org/lepl/rfc3696.html
to use:
> easy_install lepl
...
> python
...
>>> from lepl.apps.rfc3696 import HttpUrl
>>> validator = HttpUrl()
>>> validator('google')
False
>>> validator('http://google')
False
>>> validator('http://google.com')
True
Use the validators package:
>>> import validators
>>> validators.url("http://google.com")
True
>>> validators.url("http://google")
ValidationFailure(func=url, args={'value': 'http://google', 'require_tld': True})
>>> if not validators.url("http://google"):
... print "not valid"
...
not valid
>>>
Install it from PyPI with pip (pip install validators
).
I landed on this page trying to figure out a sane way to validate strings as “valid” urls. I share here my solution using python3. No extra libraries required.
See https://docs.python.org/2/library/urlparse.html if you are using python2.
See https://docs.python.org/3.0/library/urllib.parse.html if you are using python3 as I am.
import urllib
from pprint import pprint
invalid_url = 'dkakasdkjdjakdjadjfalskdjfalk'
valid_url = 'https://stackoverflow.com'
tokens = [urllib.parse.urlparse(url) for url in (invalid_url, valid_url)]
for token in tokens:
pprint(token)
min_attributes = ('scheme', 'netloc') # add attrs to your liking
for token in tokens:
if not all([getattr(token, attr) for attr in min_attributes]):
error = "'{url}' string has no scheme or netloc.".format(url=token.geturl())
print(error)
else:
print("'{url}' is probably a valid url.".format(url=token.geturl()))
ParseResult(scheme=”, netloc=”, path=’dkakasdkjdjakdjadjfalskdjfalk’, params=”, query=”, fragment=”)
ParseResult(scheme=’https’, netloc=’stackoverflow.com’, path=”, params=”, query=”, fragment=”)
‘dkakasdkjdjakdjadjfalskdjfalk’ string has no scheme or netloc.
‘https://stackoverflow.com‘ is probably a valid url.
Here is a more concise function:
from urllib.parse import urlparse
min_attributes = ('scheme', 'netloc')
def is_valid(url, qualifying=min_attributes):
tokens = urlparse(url)
return all([getattr(tokens, qualifying_attr)
for qualifying_attr in qualifying])
A True or False version, based on @DMfll answer:
try:
# python2
from urlparse import urlparse
except:
# python3
from urllib.parse import urlparse
a = 'http://www.cwi.nl:80/%7Eguido/Python.html'
b = '/data/Python.html'
c = 532
d = u'dkakasdkjdjakdjadjfalskdjfalk'
e = 'https://stackoverflow.com'
def uri_validator(x):
try:
result = urlparse(x)
return all([result.scheme, result.netloc])
except:
return False
print(uri_validator(a))
print(uri_validator(b))
print(uri_validator(c))
print(uri_validator(d))
print(uri_validator(e))
Gives:
True
False
False
False
True
EDIT
As pointed out by @Kwame , the below code does validate the url even if the
.com
or.co
etc are not present.also pointed out by @Blaise, URLs like https://www.google is a valid URL
and you need to do a DNS check for checking if it resolves or not, separately.
This is simple and works:
So min_attr
contains the basic set of strings that needs to be present to define the validity of a URL,
i.e http://
part and google.com
part.
urlparse.scheme
stores http://
and
urlparse.netloc
store the domain name google.com
from urlparse import urlparse
def url_check(url):
min_attr = ('scheme' , 'netloc')
try:
result = urlparse(url)
if all([result.scheme, result.netloc]):
return True
else:
return False
except:
return False
all()
returns true if all the variables inside it return true.
So if result.scheme
and result.netloc
is present i.e. has some value then the URL is valid and hence returns True
.
Nowadays, I use the following, based on the Padam’s answer:
$ python --version
Python 3.6.5
And this is how it looks:
from urllib.parse import urlparse
def is_url(url):
try:
result = urlparse(url)
return all([result.scheme, result.netloc])
except ValueError:
return False
Just use is_url("http://www.asdf.com")
.
Hope it helps!
Validate URL with urllib
and Django-like regex
The Django URL validation regex was actually pretty good but I needed to tweak it a little bit for my use case. Feel free to adapt it to yours!
Python 3.7
import re
import urllib
# Check https://regex101.com/r/A326u1/5 for reference
DOMAIN_FORMAT = re.compile(
r"(?:^(w{1,255}):(.{1,255})@|^)" # http basic authentication [optional]
r"(?:(?:(?=S{0,253}(?:$|:))" # check full domain length to be less than or equal to 253 (starting after http basic auth, stopping before port)
r"((?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?.)+" # check for at least one subdomain (maximum length per subdomain: 63 characters), dashes in between allowed
r"(?:[a-z0-9]{1,63})))" # check for top level domain, no dashes allowed
r"|localhost)" # accept also "localhost" only
r"(:d{1,5})?", # port [optional]
re.IGNORECASE
)
SCHEME_FORMAT = re.compile(
r"^(http|hxxp|ftp|fxp)s?$", # scheme: http(s) or ftp(s)
re.IGNORECASE
)
def validate_url(url: str):
url = url.strip()
if not url:
raise Exception("No URL specified")
if len(url) > 2048:
raise Exception("URL exceeds its maximum length of 2048 characters (given length={})".format(len(url)))
result = urllib.parse.urlparse(url)
scheme = result.scheme
domain = result.netloc
if not scheme:
raise Exception("No URL scheme specified")
if not re.fullmatch(SCHEME_FORMAT, scheme):
raise Exception("URL scheme must either be http(s) or ftp(s) (given scheme={})".format(scheme))
if not domain:
raise Exception("No URL domain specified")
if not re.fullmatch(DOMAIN_FORMAT, domain):
raise Exception("URL domain malformed (domain={})".format(domain))
return url
Explanation
- The code only validates the
scheme
andnetloc
part of a given URL. (To do this properly, I split the URL withurllib.parse.urlparse()
in the two according parts which are then matched with the corresponding regex terms.) -
The
netloc
part stops before the first occurrence of a slash/
, soport
numbers are still part of thenetloc
, e.g.:https://www.google.com:80/search?q=python ^^^^^ ^^^^^^^^^^^^^^^^^ | | | +-- netloc (aka "domain" in my code) +-- scheme
-
IPv4 addresses are also validated
IPv6 Support
If you want the URL validator to also work with IPv6 addresses, do the following:
- Add
is_valid_ipv6(ip)
from Markus Jarderot’s answer, which has a really good IPv6 validator regex - Add
and not is_valid_ipv6(domain)
to the lastif
Examples
Here are some examples of the regex for the netloc
(aka domain
) part in action:
- IPv4 and alphanumeric: https://regex101.com/r/A326u1/5
- IPv6: https://regex101.com/r/lKIIgq/1 (with the regex from Markus Jarderot’s answer)
All of the above solutions recognize a string like “http://www.google.com/path,www.yahoo.com/path” as valid. This solution always works as it should
import re
# URL-link validation
ip_middle_octet = u"(?:.(?:1?d{1,2}|2[0-4]d|25[0-5]))"
ip_last_octet = u"(?:.(?:[1-9]d?|1dd|2[0-4]d|25[0-4]))"
URL_PATTERN = re.compile(
u"^"
# protocol identifier
u"(?:(?:https?|ftp|rtsp|rtp|mmp)://)"
# user:pass authentication
u"(?:S+(?::S*)?@)?"
u"(?:"
u"(?P<private_ip>"
# IP address exclusion
# private & local networks
u"(?:localhost)|"
u"(?:(?:10|127)" + ip_middle_octet + u"{2}" + ip_last_octet + u")|"
u"(?:(?:169.254|192.168)" + ip_middle_octet + ip_last_octet + u")|"
u"(?:172.(?:1[6-9]|2d|3[0-1])" + ip_middle_octet + ip_last_octet + u"))"
u"|"
# IP address dotted notation octets
# excludes loopback network 0.0.0.0
# excludes reserved space >= 224.0.0.0
# excludes network & broadcast addresses
# (first & last IP address of each class)
u"(?P<public_ip>"
u"(?:[1-9]d?|1dd|2[01]d|22[0-3])"
u"" + ip_middle_octet + u"{2}"
u"" + ip_last_octet + u")"
u"|"
# host name
u"(?:(?:[a-zu00a1-uffff0-9_-]-?)*[a-zu00a1-uffff0-9_-]+)"
# domain name
u"(?:.(?:[a-zu00a1-uffff0-9_-]-?)*[a-zu00a1-uffff0-9_-]+)*"
# TLD identifier
u"(?:.(?:[a-zu00a1-uffff]{2,}))"
u")"
# port number
u"(?::d{2,5})?"
# resource path
u"(?:/S*)?"
# query string
u"(?:?S*)?"
u"$",
re.UNICODE | re.IGNORECASE
)
def url_validate(url):
""" URL string validation
"""
return re.compile(URL_PATTERN).match(url)
Not directly relevant, but often it’s required to identify whether some token CAN be a url or not, not necessarily 100% correctly formed (ie, https part omitted and so on). I’ve read this post and did not find the solution, so I am posting my own here for the sake of completeness.
def get_domain_suffixes():
import requests
res=requests.get('https://publicsuffix.org/list/public_suffix_list.dat')
lst=set()
for line in res.text.split('n'):
if not line.startswith('//'):
domains=line.split('.')
cand=domains[-1]
if cand:
lst.add('.'+cand)
return tuple(sorted(lst))
domain_suffixes=get_domain_suffixes()
def reminds_url(txt:str):
"""
>>> reminds_url('yandex.ru.com/somepath')
True
"""
ltext=txt.lower().split('/')[0]
return ltext.startswith(('http','www','ftp')) or ltext.endswith(domain_suffixes)
Here’s a regex solution since top voted regex doesn’t work for weird cases like top-level domain. Some test cases down below.
regex = re.compile(
r"(w+://)?" # protocol (optional)
r"(w+.)?" # host (optional)
r"((w+).(w+))" # domain
r"(.w+)*" # top-level domain (optional, can have > 1)
r"([w-._~/]*)*(?<!.)" # path, params, anchors, etc. (optional)
)
cases = [
"http://www.google.com",
"https://www.google.com",
"http://google.com",
"https://google.com",
"www.google.com",
"google.com",
"http://www.google.com/~as_db3.2123/134-1a",
"https://www.google.com/~as_db3.2123/134-1a",
"http://google.com/~as_db3.2123/134-1a",
"https://google.com/~as_db3.2123/134-1a",
"www.google.com/~as_db3.2123/134-1a",
"google.com/~as_db3.2123/134-1a",
# .co.uk top level
"http://www.google.co.uk",
"https://www.google.co.uk",
"http://google.co.uk",
"https://google.co.uk",
"www.google.co.uk",
"google.co.uk",
"http://www.google.co.uk/~as_db3.2123/134-1a",
"https://www.google.co.uk/~as_db3.2123/134-1a",
"http://google.co.uk/~as_db3.2123/134-1a",
"https://google.co.uk/~as_db3.2123/134-1a",
"www.google.co.uk/~as_db3.2123/134-1a",
"google.co.uk/~as_db3.2123/134-1a",
"https://...",
"https://..",
"https://.",
"https://.google.com",
"https://..google.com",
"https://...google.com",
"https://.google..com",
"https://.google...com"
"https://...google..com",
"https://...google...com",
".google.com",
".google.co."
"https://google.co."
]
for c in cases:
print(c, regex.match(c).span()[1] - regex.match(c).span()[0] == len(c))
Function based on Dominic Tarro answer:
import re
def is_url(x):
return bool(re.match(
r"(https?|ftp)://" # protocol
r"(w+(-w+)*.)?" # host (optional)
r"((w+(-w+)*).(w+))" # domain
r"(.w+)*" # top-level domain (optional, can have > 1)
r"([w-._~/]*)*(?<!.)" # path, params, anchors, etc. (optional)
, x))
Pydantic could be used to do that. I’m not very used to it so I can’t say about it’s limitations. It is an option thou and no one suggested it.
I have seen that many people questioned about ftp and files URL in previous answers so I recommend to get known to the documentation as Pydantic have many types for validation as FileUrl, AnyUrl and even database url types.
A simplistic usage example:
from requests import get, HTTPError, ConnectionError
from pydantic import BaseModel, AnyHttpUrl, ValidationError
class MyConfModel(BaseModel):
URI: AnyHttpUrl
try:
myAddress = MyConfModel(URI = "http://myurl.com/")
req = get(myAddress.URI, verify=False)
print(myAddress.URI)
except(ValidationError):
print('Invalid destination')
Pydantic also raises exceptions (pydantic.ValidationError) that can be used to handle errors.
I have teste it with these patterns:
- http://localhost (pass)
- http://localhost:8080 (pass)
- http://example.com (pass)
- http://user:[email protected] (pass)
- http://_example.com (pass)
- http://&example.com (fails)
- http://-example.com (fails)
from urllib.parse import urlparse
def is_valid_url(url):
try:
result = urlparse(url)
return all([result.scheme, result.netloc])
except ValueError:
return False
url = 'http://google.com'
if is_valid_url(url):
print('Valid URL')
else:
print('Malformed URL')
Use this example to conduct your own meaning of an "URL", and apply it everywhere in your code:
# DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
# Version 2, December 2004
#
# Copyright (C) 2004 Sam Hocevar <[email protected]>
#
# Everyone is permitted to copy and distribute verbatim or modified
# copies of this license document, and changing it is allowed as long
# as the name is changed.
#
# DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
# TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
#
# 0. You just DO WHAT THE FUCK YOU WANT TO.
#
# Copyright © 2023 Anthony [email protected]
#
# This work is free. You can redistribute it and/or modify it under the
# terms of the Do What The Fuck You Want To Public License, Version 2,
# as published by Sam Hocevar. See the LICENSE file for more details.
import operator as op
from urllib.parse import (
ParseResult,
urlparse,
)
import attrs
import pytest
from phantom import Phantom
from phantom.fn import compose2
def is_url_address(value: str) -> bool:
return any(urlparse(value))
class URL(str, Phantom, predicate=is_url_address):
pass
# presume that an empty URL is a nonsense
def test_empty_url():
with pytest.raises(TypeError, match="Could not parse .* from ''"):
URL.parse("")
# is it enough now?
def test_url():
assert URL.parse("http://")
scheme_and_netloc = op.attrgetter("scheme", "netloc")
def has_scheme_and_netloc(value: ParseResult) -> bool:
return all(scheme_and_netloc(value))
# need a bit of FP magic here
class ReachableURL(URL, predicate=compose2(has_scheme_and_netloc, urlparse)):
pass
def test_empty_reachable_url():
with pytest.raises(TypeError, match="Could not parse .* from ''"):
ReachableURL.parse("")
# but "empty" for an URL is not just "empty string"
def test_reachable_url_probably_wrong_host():
assert ReachableURL.parse("http://example")
def test_reachable_url():
assert ReachableURL.parse("http://example.com")
def test_reachable_url_without_scheme():
with pytest.raises(TypeError, match="Could not parse .* from 'example.com'"):
ReachableURL.parse("example.com")
# constructor works too
def test_constructor():
assert ReachableURL("http://example.com")
# but it *is* `str`
def test_url_is_str():
assert isinstance(ReachableURL("http://example.com"), str)
# now we can write plain old classes utilizing our `URL` and `ReachableURL`
# I'm lazy...
@attrs.define
class Person:
homepage: ReachableURL
def test_person():
person = Person(homepage=ReachableURL("https://example.com/index.html"))
assert person.homepage
def greet(person: Person) -> None:
print(f"Hello! I will definitely visit you at {person.homepage}.")
if __name__ == "__main__":
greet(Person(homepage=ReachableURL.parse("tg://resolve?username")))
It will not be surprising if an URL RFC turns out to be Turing-complete!