How do you validate a URL with a regular expression in Python?
Question:
I’m building an app on Google App Engine. I’m incredibly new to Python and have been beating my head against the following problem for the past 3 days.
I have a class to represent an RSS Feed and in this class I have a method called setUrl. Input to this method is a URL.
I’m trying to use the re python module to validate off of the RFC 3986 Reg-ex (http://www.ietf.org/rfc/rfc3986.txt)
Below is a snipped which should work?
p = re.compile('^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))?')
m = p.match(url)
if m:
self.url = url
return url
Answers:
The regex provided should match any url of the form http://www.ietf.org/rfc/rfc3986.txt; and does when tested in the python interpreter.
What format have the URLs you’ve been having trouble parsing had?
urlfinders = [
re.compile("([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}|(((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\.)[-A-Za-z0-9\.]+)(:[0-9]*)?/[-A-Za-z0-9_\$\.\+\!\*\(\),;:@&=\?/~\#\%]*[^]'\.}>\),\"]"),
re.compile("([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}|(((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\.)[-A-Za-z0-9\.]+)(:[0-9]*)?"),
re.compile("(~/|/|\./)([-A-Za-z0-9_\$\.\+\!\*\(\),;:@&=\?/~\#\%]|\\
)+"),
re.compile("'\<((mailto:)|)[-A-Za-z0-9\.]+@[-A-Za-z0-9\.]+"),
]
NOTE: As ugly as it looks in your browser just copy paste and the formatting should be good
Found at the python mailing lists and used for the gnome-terminal
source: http://mail.python.org/pipermail/python-list/2007-January/595436.html
An easy way to parse (and validate) URL’s is the urlparse
(py2, py3) module.
A regex is too much work.
There’s no “validate” method because almost anything is a valid URL. There are some punctuation rules for splitting it up. Absent any punctuation, you still have a valid URL.
Check the RFC carefully and see if you can construct an “invalid” URL. The rules are very flexible.
For example :::::
is a valid URL. The path is ":::::"
. A pretty stupid filename, but a valid filename.
Also, /////
is a valid URL. The netloc (“hostname”) is ""
. The path is "///"
. Again, stupid. Also valid. This URL normalizes to "///"
which is the equivalent.
Something like "bad://///worse/////"
is perfectly valid. Dumb but valid.
Bottom Line. Parse it, and look at the pieces to see if they’re displeasing in some way.
Do you want the scheme to always be “http”? Do you want the netloc to always be “www.somename.somedomain”? Do you want the path to look unix-like? Or windows-like? Do you want to remove the query string? Or preserve it?
These are not RFC-specified validations. These are validations unique to your application.
I admit, I find your regular expression totally incomprehensible. I wonder if you could use urlparse instead? Something like:
pieces = urlparse.urlparse(url)
assert all([pieces.scheme, pieces.netloc])
assert set(pieces.netloc) <= set(string.letters + string.digits + '-.') # and others?
assert pieces.scheme in ['http', 'https', 'ftp'] # etc.
It might be slower, and maybe you’ll miss conditions, but it seems (to me) a lot easier to read and debug than a regular expression for URLs.
I’ve needed to do this many times over the years and always end up copying someone else’s regular expression who has thought about it way more than I want to think about it.
Having said that, there is a regex in the Django forms code which should do the trick:
http://code.djangoproject.com/browser/django/trunk/django/forms/fields.py#L534
Here’s the complete regexp to parse a URL.
(?:https?://(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?)
.)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d
+)){3}))(?::(?:d+))?)(?:/(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA
-Fd]{2}))|[;:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd
]{2}))|[;:@&=])*))*)(?:?(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd
]{2}))|[;:@&=])*))?)?)|(?:s?ftp://(?:(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),
]|(?:%[a-fA-Fd]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:
%[a-fA-Fd]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Z
d]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(
?:(?:d+)(?:.(?:d+)){3}))(?::(?:d+))?))(?:/(?:(?:(?:(?:[a-zA-Zd$-
_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!
*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*))*)(?:;type=[AIDaid])?)?)|(?:news
:(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[;/?:&=])+@(?:
(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:
(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3})))|(?:[a-zA
-Z](?:[a-zA-Zd]|[_.+-])*)|*))|(?:nntp://(?:(?:(?:(?:(?:[a-zA-Zd](?:
(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA
-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:d+))?)/(?:[a-zA-Z](?:[a-
zA-Zd]|[_.+-])*)(?:/(?:d+))?)|(?:telnet://(?:(?:(?:(?:(?:[a-zA-Zd$
-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Zd$-_.+!
*'(),]|(?:%[a-fA-Fd]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Zd](?:(
?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-
Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:d+))?))/?)|(?:gopher://(?
:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z
](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?
:d+))?)(?:/(?:[a-zA-Zd$-_.+!*'(),;/?:@&=]|(?:%[a-fA-Fd]{2}))(?:(?:
(?:[a-zA-Zd$-_.+!*'(),;/?:@&=]|(?:%[a-fA-Fd]{2}))*)(?:%09(?:(?:(?:[
a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[;:@&=])*)(?:%09(?:(?:[a-zA-
Zd$-_.+!*'(),;/?:@&=]|(?:%[a-fA-Fd]{2}))*))?)?)?)?)|(?:wais://(?:(?
:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?
:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:d
+))?)/(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))*)(?:(?:/(?:(?:[
a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))*)/(?:(?:[a-zA-Zd$-_.+!*'()
,]|(?:%[a-fA-Fd]{2}))*))|?(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-
Fd]{2}))|[;:@&=])*))?)|(?:mailto:(?:(?:[a-zA-Zd$-_.+!*'(),;/?:@&=]|
(?:%[a-fA-Fd]{2}))+))|(?:file://(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-
Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))
|(?:(?:d+)(?:.(?:d+)){3}))|localhost)?/(?:(?:(?:(?:[a-zA-Zd$-_.+!
*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'()
,]|(?:%[a-fA-Fd]{2}))|[?:@&=])*))*))|(?:prospero://(?:(?:(?:(?:(?:[a-
zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd
]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:d+))?)/(?:(?:(
?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*)(?:/(?:(?:(?
:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*))*)(?:(?:;(?:(?:
(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&])*)=(?:(?:(?:[a-zA
-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&])*)))*)|(?:ldap://(?:(?:(?
:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?
:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:d
+))?))?/(?:(?:(?:(?:(?:(?:(?:[a-zA-Zd]|%(?:3d|[46][a-fA-Fd]|[57][Aa
d]))|(?:%20))+|(?:OID|oid).(?:(?:d+)(?:.(?:d+))*))(?:(?:%0[Aa])?(
?:%20)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-
fA-Fd]{2}))*))(?:(?:(?:%0[Aa])?(?:%20)*)+(?:(?:%0[Aa])?(?:%20)*)(?:(
?:(?:(?:(?:[a-zA-Zd]|%(?:3d|[46][a-fA-Fd]|[57][Aad]))|(?:%20))+|(?
:OID|oid).(?:(?:d+)(?:.(?:d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[
Aa])?(?:%20)*))?(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))*)))*)
(?:(?:(?:(?:%0[Aa])?(?:%20)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))(?:(?:(?:
(?:(?:(?:[a-zA-Zd]|%(?:3d|[46][a-fA-Fd]|[57][Aad]))|(?:%20))+|(?:O
ID|oid).(?:(?:d+)(?:.(?:d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa
])?(?:%20)*))?(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))*))(?:(?
:(?:%0[Aa])?(?:%20)*)+(?:(?:%0[Aa])?(?:%20)*)(?:(?:(?:(?:(?:[a-zA-Zd
]|%(?:3d|[46][a-fA-Fd]|[57][Aad]))|(?:%20))+|(?:OID|oid).(?:(?:d+
)(?:.(?:d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(
?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))*)))*))*(?:(?:(?:%0[Aa])?(
?:%20)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))?)(?:?(?:(?:(?:(?:[a-zA-Zd$
-_.+!*'(),]|(?:%[a-fA-Fd]{2}))+)(?:,(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%
[a-fA-Fd]{2}))+))*)?)(?:?(?:base|one|sub)(?:?(?:((?:[a-zA-Zd$-_.+
!*'(),;/?:@&=]|(?:%[a-fA-Fd]{2}))+)))?)?)?)|(?:(?:z39.50[rs])://(?:(
?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](
?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:
d+))?)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))+)(?:+(?
:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))+))*(?:?(?:(?:[a-zA-Zd
$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))+))?)?(?:;esn=(?:(?:[a-zA-Zd$-_.+!*
'(),]|(?:%[a-fA-Fd]{2}))+))?(?:;rs=(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[
a-fA-Fd]{2}))+)(?:+(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))+
))*)?))|(?:cid:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[;?
:@&=])*))|(?:mid:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[
;?:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[;?:
@&=])*))?)|(?:vemmi://(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-
zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)
(?:.(?:d+)){3}))(?::(?:d+))?)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?
:%[a-fA-Fd]{2}))|[/?:@&=])*)(?:(?:;(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?
:%[a-fA-Fd]{2}))|[/?:@&])*)=(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA
-Fd]{2}))|[/?:@&])*))*))?)|(?:imap://(?:(?:(?:(?:(?:(?:(?:[a-zA-Zd$
-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[&=~])+)(?:(?:;[Aa][Uu][Tt][Hh]=(?:*|
(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[&=~])+))))?)|(?:(
?:;[Aa][Uu][Tt][Hh]=(?:*|(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-F
d]{2}))|[&=~])+)))(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}
))|[&=~])+))?))@)?(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Z
d])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:
.(?:d+)){3}))(?::(?:d+))?))/(?:(?:(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]
|(?:%[a-fA-Fd]{2}))|[&=~:@/])+)?;[Tt][Yy][Pp][Ee]=(?:[Ll](?:[Ii][Ss][
Tt]|[Ss][Uu][Bb])))|(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{
2}))|[&=~:@/])+)(?:?(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}
))|[&=~:@/])+))?(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?
:[1-9]d*)))?)|(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|
[&=~:@/])+)(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?:[1-9
]d*)))?(?:/;[Uu][Ii][Dd]=(?:[1-9]d*))(?:(?:/;[Ss][Ee][Cc][Tt][Ii][Oo
][Nn]=(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[&=~:@/])+))
)?)))?)|(?:nfs:(?:(?://(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a
-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+
)(?:.(?:d+)){3}))(?::(?:d+))?)(?:(?:/(?:(?:(?:(?:(?:[a-zA-Zd$-_.
!~*'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA-Zd$-_.!~*
'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*))*)?)))?)|(?:/(?:(?:(?:(?:(?:[a-zA
-Zd$-_.!~*'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA-Z
d$-_.!~*'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*))*)?))|(?:(?:(?:(?:(?:[a
-zA-Zd$-_.!~*'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA
-Zd$-_.!~*'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*))*)?)))
Given its complexibility, I think you should go the urlparse way.
For completeness, here’s the pseudo-BNF of the above regex (as a documentation):
; The generic form of a URL is:
genericurl = scheme ":" schemepart
; Specific predefined schemes are defined here; new schemes
; may be registered with IANA
url = httpurl | ftpurl | newsurl |
nntpurl | telneturl | gopherurl |
waisurl | mailtourl | fileurl |
prosperourl | otherurl
; new schemes follow the general syntax
otherurl = genericurl
; the scheme is in lower case; interpreters should use case-ignore
scheme = 1*[ lowalpha | digit | "+" | "-" | "." ]
schemepart = *xchar | ip-schemepart
; URL schemeparts for ip based protocols:
ip-schemepart = "//" login [ "/" urlpath ]
login = [ user [ ":" password ] "@" ] hostport
hostport = host [ ":" port ]
host = hostname | hostnumber
hostname = *[ domainlabel "." ] toplabel
domainlabel = alphadigit | alphadigit *[ alphadigit | "-" ] alphadigit
toplabel = alpha | alpha *[ alphadigit | "-" ] alphadigit
alphadigit = alpha | digit
hostnumber = digits "." digits "." digits "." digits
port = digits
user = *[ uchar | ";" | "?" | "&" | "=" ]
password = *[ uchar | ";" | "?" | "&" | "=" ]
urlpath = *xchar ; depends on protocol see section 3.1
; The predefined schemes:
; FTP (see also RFC959)
ftpurl = "ftp://" login [ "/" fpath [ ";type=" ftptype ]]
fpath = fsegment *[ "/" fsegment ]
fsegment = *[ uchar | "?" | ":" | "@" | "&" | "=" ]
ftptype = "A" | "I" | "D" | "a" | "i" | "d"
; FILE
fileurl = "file://" [ host | "localhost" ] "/" fpath
; HTTP
httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
search = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
; GOPHER (see also RFC1436)
gopherurl = "gopher://" hostport [ / [ gtype [ selector
[ "%09" search [ "%09" gopher+_string ] ] ] ] ]
gtype = xchar
selector = *xchar
gopher+_string = *xchar
; MAILTO (see also RFC822)
mailtourl = "mailto:" encoded822addr
encoded822addr = 1*xchar ; further defined in RFC822
; NEWS (see also RFC1036)
newsurl = "news:" grouppart
grouppart = "*" | group | article
group = alpha *[ alpha | digit | "-" | "." | "+" | "_" ]
article = 1*[ uchar | ";" | "/" | "?" | ":" | "&" | "=" ] "@" host
; NNTP (see also RFC977)
nntpurl = "nntp://" hostport "/" group [ "/" digits ]
; TELNET
telneturl = "telnet://" login [ "/" ]
; WAIS (see also RFC1625)
waisurl = waisdatabase | waisindex | waisdoc
waisdatabase = "wais://" hostport "/" database
waisindex = "wais://" hostport "/" database "?" search
waisdoc = "wais://" hostport "/" database "/" wtype "/" wpath
database = *uchar
wtype = *uchar
wpath = *uchar
; PROSPERO
prosperourl = "prospero://" hostport "/" ppath *[ fieldspec ]
ppath = psegment *[ "/" psegment ]
psegment = *[ uchar | "?" | ":" | "@" | "&" | "=" ]
fieldspec = ";" fieldname "=" fieldvalue
fieldname = *[ uchar | "?" | ":" | "@" | "&" ]
fieldvalue = *[ uchar | "?" | ":" | "@" | "&" ]
; Miscellaneous definitions
lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
"i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
"q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
"y" | "z"
hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
"J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
"S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
alpha = lowalpha | hialpha
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
"8" | "9"
safe = "$" | "-" | "_" | "." | "+"
extra = "!" | "*" | "'" | "(" | ")" | ","
national = "{" | "}" | "|" | "" | "^" | "~" | "[" | "]" | "`"
punctuation = "" | "#" | "%" |
reserved = ";" | "/" | "?" | ":" | "@" | "&" | "="
hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
"a" | "b" | "c" | "d" | "e" | "f"
escape = "%" hex hex
unreserved = alpha | digit | safe | extra
uchar = unreserved | escape
xchar = unreserved | reserved | escape
digits = 1*digit
urlparse
quite happily takes invalid URLs, it is more a string string-splitting library than any kind of validator. For example:
from urlparse import urlparse
urlparse('http://----')
# returns: ParseResult(scheme='http', netloc='----', path='', params='', query='', fragment='')
Depending on the situation, this might be fine..
If you mostly trust the data, and just want to verify the protocol is HTTP, then urlparse
is perfect.
If you want to make the URL is actually a legal URL, use the ridiculous regex
If you want to make sure it’s a real web address,
import urllib
try:
urllib.urlopen(url)
except IOError:
print "Not a real URL"
note – Lepl is no longer maintained or supported.
RFC 3696 defines “best practices” for URL validation – http://www.faqs.org/rfcs/rfc3696.html
The latest release of Lepl (a Python parser library) includes an implementation of RFC 3696. You would use it something like:
from lepl.apps.rfc3696 import Email, HttpUrl
# compile the validators (do once at start of program)
valid_email = Email()
valid_http_url = HttpUrl()
# use the validators (as often as you like)
if valid_email(some_email):
# email is ok
else:
# email is bad
if valid_http_url(some_url):
# url is ok
else:
# url is bad
Although the validators are defined in Lepl, which is a recursive descent parser, they are largely compiled internally to regular expressions. That combines the best of both worlds – a (relatively) easy to read definition that can be checked against RFC 3696 and an efficient implementation. There’s a post on my blog showing how this simplifies the parser – http://www.acooke.org/cute/LEPLOptimi0.html
Lepl is available at http://www.acooke.org/lepl and the RFC 3696 module is documented at http://www.acooke.org/lepl/rfc3696.html
This is completely new in this release, so may contain bugs. Please contact me if you have any problems and I will fix them ASAP. Thanks.
I’m using the one used by Django and it seems to work pretty well:
def is_valid_url(url):
import re
regex = re.compile(
r'^https?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?.)+[A-Z]{2,6}.?|' # domain...
r'localhost|' # localhost...
r'd{1,3}.d{1,3}.d{1,3}.d{1,3})' # ...or ip
r'(?::d+)?' # optional port
r'(?:/?|[/?]S+)$', re.IGNORECASE)
return url is not None and regex.search(url)
You can always check the latest version here: https://github.com/django/django/blob/master/django/core/validators.py#L74
http://pypi.python.org/pypi/rfc3987 gives regular expressions for consistency with the rules in RFC 3986 and RFC 3987 (that is, not with scheme-specific rules).
A regexp for IRI_reference is:
(?P<scheme>[a-zA-Z][a-zA-Z0-9+.-]*):(?://(?P<iauthority>(?:(?P<iuserinfo>(?:(?:[
a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU0002
0000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU
00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009ff
fdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U00
0dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:)*)@)?(?P<ihost>
[(?:(?:[0-9A-F]{1,4}:){6}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4]
[0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|::(?:[0
-9A-F]{1,4}:){5}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]
?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|[0-9A-F]{1,4}?::(
?:[0-9A-F]{1,4}:){4}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|
[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F
]{1,4}:)?[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){3}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?
:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[
0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,2}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){2}(?:
[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3
}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,3}[0-9A-F]{1,
4})?::(?:[0-9A-F]{1,4}:)(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0
-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-
9A-F]{1,4}:){,4}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]
|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|
(?:(?:[0-9A-F]{1,4}:){,5}[0-9A-F]{1,4})?::[0-9A-F]{1,4}|(?:(?:[0-9A-F]{1,4}:){,6
}[0-9A-F]{1,4})?::|v[0-9A-F]+\.(?:[a-zA-Z0-9_.~-]|[!$&'()*+,;=]|:)+)\]|(?:(?:(
?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][
0-9]?))|(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-
U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU000500
00-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00
090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffd
U000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=])*)(
?::(?P<port>[0-9]*))?)(?P<ipath>(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-uf
dcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffd
U00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007f
ffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U0
00bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-
F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>/(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7
ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000
-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU0007
0000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU
000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000eff
fd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ff
uf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-
U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU000700
00-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU00
0b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd
])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)?)|(?P<ipath>(?:(?:[a-zA-Z0-9._~-]|[
xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU
00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006ff
fdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U00
0afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-
U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa
0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00
030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffd
U00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000a
fffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U
000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>))(?:\?(?P<iquery
>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U000
1fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-
U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU000900
00-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU00
0d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|[
ue000-uf8ffU000f0000-U000ffffdU00100000-U0010fffd]|/|\?)*))?(?:\#(?P<ifra
gment>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-
U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050
000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU0
0090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfff
dU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|
@)|/|\?)*))?|(?:(?://(?P<iauthority>(?:(?P<iuserinfo>(?:(?:[a-zA-Z0-9._~-]|[xa
0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00
030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffd
U00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000a
fffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U
000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:)*)@)?(?P<ihost>\[(?:(?:[0-9A-F]{1,
4}:){6}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-
9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|::(?:[0-9A-F]{1,4}:){5}(?:
[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3
}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|[0-9A-F]{1,4}?::(?:[0-9A-F]{1,4}:){4
}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\
.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:)?[0-9A-F]{1
,4})?::(?:[0-9A-F]{1,4}:){3}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-
4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?
:[0-9A-F]{1,4}:){,2}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){2}(?:[0-9A-F]{1,4}:[0-9A
-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][
0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,3}[0-9A-F]{1,4})?::(?:[0-9A-F]{1
,4}:)(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]
?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,4}[0-
9A-F]{1,4})?::(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[
0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}
:){,5}[0-9A-F]{1,4})?::[0-9A-F]{1,4}|(?:(?:[0-9A-F]{1,4}:){,6}[0-9A-F]{1,4})?::|
v[0-9A-F]+\.(?:[a-zA-Z0-9_.~-]|[!$&'()*+,;=]|:)+)\]|(?:(?:(?:25[0-5]|2[0-4][0-
9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(?:(?:[a-zA
-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000
-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU0006
0000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU
000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dff
fdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=])*)(?::(?P<port>[0-9]*)
)?)(?P<ipath>(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU0
0010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fff
dU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U000
8fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-
U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*
+,;=]|:|@)*)*)|(?P<ipath>/(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufd
f0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU000400
00-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00
080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffd
U000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A
-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0
-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000
-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU0008
0000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU
000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F
]|[!$&'()*+,;=]|:|@)*)*)?)|(?P<ipath>(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-u
fdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffd
U00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007
fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U
000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A
-F][0-9A-F]|[!$&'()*+,;=]|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcf
ufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00
040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffd
U00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000b
fffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][
0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>))(?:\?(?P<iquery>(?:(?:(?:[a-zA-Z0-9.
_~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U000
2fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-
U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a00
00-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU00
0e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|[ue000-uf8ffU000f000
0-U000ffffdU00100000-U0010fffd]|/|\?)*))?(?:\#(?P<ifragment>(?:(?:(?:[a-zA-
Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-
U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060
000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU0
00a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfff
dU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|/|\?)*))?)
In one line:
(?P<scheme>[a-zA-Z][a-zA-Z0-9+.-]*):(?://(?P<iauthority>(?:(?P<iuserinfo>(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:)*)@)?(?P<ihost>\[(?:(?:[0-9A-F]{1,4}:){6}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|::(?:[0-9A-F]{1,4}:){5}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|[0-9A-F]{1,4}?::(?:[0-9A-F]{1,4}:){4}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:)?[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){3}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,2}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){2}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,3}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:)(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,4}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,5}[0-9A-F]{1,4})?::[0-9A-F]{1,4}|(?:(?:[0-9A-F]{1,4}:){,6}[0-9A-F]{1,4})?::|v[0-9A-F]+\.(?:[a-zA-Z0-9_.~-]|[!$&'()*+,;=]|:)+)\]|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=])*)(?::(?P<port>[0-9]*))?)(?P<ipath>(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>/(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)?)|(?P<ipath>(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>))(?:\?(?P<iquery>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|[ue000-uf8ffU000f0000-U000ffffdU00100000-U0010fffd]|/|\?)*))?(?:\#(?P<ifragment>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|/|\?)*))?|(?:(?://(?P<iauthority>(?:(?P<iuserinfo>(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:)*)@)?(?P<ihost>\[(?:(?:[0-9A-F]{1,4}:){6}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|::(?:[0-9A-F]{1,4}:){5}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|[0-9A-F]{1,4}?::(?:[0-9A-F]{1,4}:){4}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:)?[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){3}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,2}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){2}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,3}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:)(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,4}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,5}[0-9A-F]{1,4})?::[0-9A-F]{1,4}|(?:(?:[0-9A-F]{1,4}:){,6}[0-9A-F]{1,4})?::|v[0-9A-F]+\.(?:[a-zA-Z0-9_.~-]|[!$&'()*+,;=]|:)+)\]|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=])*)(?::(?P<port>[0-9]*))?)(?P<ipath>(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>/(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)?)|(?P<ipath>(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>))(?:\?(?P<iquery>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|[ue000-uf8ffU000f0000-U000ffffdU00100000-U0010fffd]|/|\?)*))?(?:\#(?P<ifragment>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|/|\?)*))?)
Nowadays, in 90% of case if you working with URL in Python you probably use python-requests. Hence the question here – why not reuse URL validation from requests?
from requests.models import PreparedRequest
import requests.exceptions
def check_url(url):
prepared_request = PreparedRequest()
try:
prepared_request.prepare_url(url, None)
return prepared_request.url
except requests.exceptions.MissingSchema, e:
raise SomeException
Features:
- Don’t reinvent the wheel
- DRY
- Work offline
- Minimal resource
modified django url validation regex:
import re
ul = 'u00a1-uffff' # unicode letters range (must not be a raw string)
# IP patterns
ipv4_re = r'(?:25[0-5]|2[0-4]d|[0-1]?d?d)(?:.(?:25[0-5]|2[0-4]d|[0-1]?d?d)){3}'
ipv6_re = r'[[0-9a-f:.]+]'
# Host patterns
hostname_re = r'[a-z' + ul + r'0-9](?:[a-z' + ul + r'0-9-]{0,61}[a-z' + ul + r'0-9])?'
domain_re = r'(?:.(?!-)[a-z' + ul + r'0-9-]{1,63}(?<!-))*' # domain names have max length of 63 characters
tld_re = (
r'.' # dot
r'(?!-)' # can't start with a dash
r'(?:[a-z' + ul + '-]{2,63}' # domain label
r'|xn--[a-z0-9]{1,59})' # or punycode label
r'(?<!-)' # can't end with a dash
r'.?' # may have a trailing dot
)
host_re = '(' + hostname_re + domain_re + tld_re + '|localhost)'
regex = re.compile(
r'^(?:http|ftp)s?://' # http(s):// or ftp(s)://
r'(?:S+(?::S*)?@)?' # user:pass authentication
r'(?:' + ipv4_re + '|' + ipv6_re + '|' + host_re + ')' # localhost or ip
r'(?::d{2,5})?' # optional port
r'(?:[/?#][^s]*)?' # resource path
r'Z', re.IGNORECASE)
source: https://github.com/django/django/blob/master/django/core/validators.py#L74
I’m building an app on Google App Engine. I’m incredibly new to Python and have been beating my head against the following problem for the past 3 days.
I have a class to represent an RSS Feed and in this class I have a method called setUrl. Input to this method is a URL.
I’m trying to use the re python module to validate off of the RFC 3986 Reg-ex (http://www.ietf.org/rfc/rfc3986.txt)
Below is a snipped which should work?
p = re.compile('^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))?')
m = p.match(url)
if m:
self.url = url
return url
The regex provided should match any url of the form http://www.ietf.org/rfc/rfc3986.txt; and does when tested in the python interpreter.
What format have the URLs you’ve been having trouble parsing had?
urlfinders = [
re.compile("([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}|(((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\.)[-A-Za-z0-9\.]+)(:[0-9]*)?/[-A-Za-z0-9_\$\.\+\!\*\(\),;:@&=\?/~\#\%]*[^]'\.}>\),\"]"),
re.compile("([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}|(((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\.)[-A-Za-z0-9\.]+)(:[0-9]*)?"),
re.compile("(~/|/|\./)([-A-Za-z0-9_\$\.\+\!\*\(\),;:@&=\?/~\#\%]|\\
)+"),
re.compile("'\<((mailto:)|)[-A-Za-z0-9\.]+@[-A-Za-z0-9\.]+"),
]
NOTE: As ugly as it looks in your browser just copy paste and the formatting should be good
Found at the python mailing lists and used for the gnome-terminal
source: http://mail.python.org/pipermail/python-list/2007-January/595436.html
An easy way to parse (and validate) URL’s is the urlparse
(py2, py3) module.
A regex is too much work.
There’s no “validate” method because almost anything is a valid URL. There are some punctuation rules for splitting it up. Absent any punctuation, you still have a valid URL.
Check the RFC carefully and see if you can construct an “invalid” URL. The rules are very flexible.
For example :::::
is a valid URL. The path is ":::::"
. A pretty stupid filename, but a valid filename.
Also, /////
is a valid URL. The netloc (“hostname”) is ""
. The path is "///"
. Again, stupid. Also valid. This URL normalizes to "///"
which is the equivalent.
Something like "bad://///worse/////"
is perfectly valid. Dumb but valid.
Bottom Line. Parse it, and look at the pieces to see if they’re displeasing in some way.
Do you want the scheme to always be “http”? Do you want the netloc to always be “www.somename.somedomain”? Do you want the path to look unix-like? Or windows-like? Do you want to remove the query string? Or preserve it?
These are not RFC-specified validations. These are validations unique to your application.
I admit, I find your regular expression totally incomprehensible. I wonder if you could use urlparse instead? Something like:
pieces = urlparse.urlparse(url)
assert all([pieces.scheme, pieces.netloc])
assert set(pieces.netloc) <= set(string.letters + string.digits + '-.') # and others?
assert pieces.scheme in ['http', 'https', 'ftp'] # etc.
It might be slower, and maybe you’ll miss conditions, but it seems (to me) a lot easier to read and debug than a regular expression for URLs.
I’ve needed to do this many times over the years and always end up copying someone else’s regular expression who has thought about it way more than I want to think about it.
Having said that, there is a regex in the Django forms code which should do the trick:
http://code.djangoproject.com/browser/django/trunk/django/forms/fields.py#L534
Here’s the complete regexp to parse a URL.
(?:https?://(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?)
.)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d
+)){3}))(?::(?:d+))?)(?:/(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA
-Fd]{2}))|[;:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd
]{2}))|[;:@&=])*))*)(?:?(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd
]{2}))|[;:@&=])*))?)?)|(?:s?ftp://(?:(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),
]|(?:%[a-fA-Fd]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:
%[a-fA-Fd]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Z
d]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(
?:(?:d+)(?:.(?:d+)){3}))(?::(?:d+))?))(?:/(?:(?:(?:(?:[a-zA-Zd$-
_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!
*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*))*)(?:;type=[AIDaid])?)?)|(?:news
:(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[;/?:&=])+@(?:
(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:
(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3})))|(?:[a-zA
-Z](?:[a-zA-Zd]|[_.+-])*)|*))|(?:nntp://(?:(?:(?:(?:(?:[a-zA-Zd](?:
(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA
-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:d+))?)/(?:[a-zA-Z](?:[a-
zA-Zd]|[_.+-])*)(?:/(?:d+))?)|(?:telnet://(?:(?:(?:(?:(?:[a-zA-Zd$
-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Zd$-_.+!
*'(),]|(?:%[a-fA-Fd]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Zd](?:(
?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-
Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:d+))?))/?)|(?:gopher://(?
:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z
](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?
:d+))?)(?:/(?:[a-zA-Zd$-_.+!*'(),;/?:@&=]|(?:%[a-fA-Fd]{2}))(?:(?:
(?:[a-zA-Zd$-_.+!*'(),;/?:@&=]|(?:%[a-fA-Fd]{2}))*)(?:%09(?:(?:(?:[
a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[;:@&=])*)(?:%09(?:(?:[a-zA-
Zd$-_.+!*'(),;/?:@&=]|(?:%[a-fA-Fd]{2}))*))?)?)?)?)|(?:wais://(?:(?
:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?
:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:d
+))?)/(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))*)(?:(?:/(?:(?:[
a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))*)/(?:(?:[a-zA-Zd$-_.+!*'()
,]|(?:%[a-fA-Fd]{2}))*))|?(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-
Fd]{2}))|[;:@&=])*))?)|(?:mailto:(?:(?:[a-zA-Zd$-_.+!*'(),;/?:@&=]|
(?:%[a-fA-Fd]{2}))+))|(?:file://(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-
Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))
|(?:(?:d+)(?:.(?:d+)){3}))|localhost)?/(?:(?:(?:(?:[a-zA-Zd$-_.+!
*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'()
,]|(?:%[a-fA-Fd]{2}))|[?:@&=])*))*))|(?:prospero://(?:(?:(?:(?:(?:[a-
zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd
]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:d+))?)/(?:(?:(
?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*)(?:/(?:(?:(?
:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*))*)(?:(?:;(?:(?:
(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&])*)=(?:(?:(?:[a-zA
-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&])*)))*)|(?:ldap://(?:(?:(?
:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?
:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:d
+))?))?/(?:(?:(?:(?:(?:(?:(?:[a-zA-Zd]|%(?:3d|[46][a-fA-Fd]|[57][Aa
d]))|(?:%20))+|(?:OID|oid).(?:(?:d+)(?:.(?:d+))*))(?:(?:%0[Aa])?(
?:%20)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-
fA-Fd]{2}))*))(?:(?:(?:%0[Aa])?(?:%20)*)+(?:(?:%0[Aa])?(?:%20)*)(?:(
?:(?:(?:(?:[a-zA-Zd]|%(?:3d|[46][a-fA-Fd]|[57][Aad]))|(?:%20))+|(?
:OID|oid).(?:(?:d+)(?:.(?:d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[
Aa])?(?:%20)*))?(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))*)))*)
(?:(?:(?:(?:%0[Aa])?(?:%20)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))(?:(?:(?:
(?:(?:(?:[a-zA-Zd]|%(?:3d|[46][a-fA-Fd]|[57][Aad]))|(?:%20))+|(?:O
ID|oid).(?:(?:d+)(?:.(?:d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa
])?(?:%20)*))?(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))*))(?:(?
:(?:%0[Aa])?(?:%20)*)+(?:(?:%0[Aa])?(?:%20)*)(?:(?:(?:(?:(?:[a-zA-Zd
]|%(?:3d|[46][a-fA-Fd]|[57][Aad]))|(?:%20))+|(?:OID|oid).(?:(?:d+
)(?:.(?:d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(
?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))*)))*))*(?:(?:(?:%0[Aa])?(
?:%20)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))?)(?:?(?:(?:(?:(?:[a-zA-Zd$
-_.+!*'(),]|(?:%[a-fA-Fd]{2}))+)(?:,(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%
[a-fA-Fd]{2}))+))*)?)(?:?(?:base|one|sub)(?:?(?:((?:[a-zA-Zd$-_.+
!*'(),;/?:@&=]|(?:%[a-fA-Fd]{2}))+)))?)?)?)|(?:(?:z39.50[rs])://(?:(
?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](
?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:
d+))?)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))+)(?:+(?
:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))+))*(?:?(?:(?:[a-zA-Zd
$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))+))?)?(?:;esn=(?:(?:[a-zA-Zd$-_.+!*
'(),]|(?:%[a-fA-Fd]{2}))+))?(?:;rs=(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[
a-fA-Fd]{2}))+)(?:+(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))+
))*)?))|(?:cid:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[;?
:@&=])*))|(?:mid:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[
;?:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[;?:
@&=])*))?)|(?:vemmi://(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-
zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)
(?:.(?:d+)){3}))(?::(?:d+))?)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?
:%[a-fA-Fd]{2}))|[/?:@&=])*)(?:(?:;(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?
:%[a-fA-Fd]{2}))|[/?:@&])*)=(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA
-Fd]{2}))|[/?:@&])*))*))?)|(?:imap://(?:(?:(?:(?:(?:(?:(?:[a-zA-Zd$
-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[&=~])+)(?:(?:;[Aa][Uu][Tt][Hh]=(?:*|
(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[&=~])+))))?)|(?:(
?:;[Aa][Uu][Tt][Hh]=(?:*|(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-F
d]{2}))|[&=~])+)))(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}
))|[&=~])+))?))@)?(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Z
d])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:
.(?:d+)){3}))(?::(?:d+))?))/(?:(?:(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]
|(?:%[a-fA-Fd]{2}))|[&=~:@/])+)?;[Tt][Yy][Pp][Ee]=(?:[Ll](?:[Ii][Ss][
Tt]|[Ss][Uu][Bb])))|(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{
2}))|[&=~:@/])+)(?:?(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}
))|[&=~:@/])+))?(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?
:[1-9]d*)))?)|(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|
[&=~:@/])+)(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?:[1-9
]d*)))?(?:/;[Uu][Ii][Dd]=(?:[1-9]d*))(?:(?:/;[Ss][Ee][Cc][Tt][Ii][Oo
][Nn]=(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[&=~:@/])+))
)?)))?)|(?:nfs:(?:(?://(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a
-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+
)(?:.(?:d+)){3}))(?::(?:d+))?)(?:(?:/(?:(?:(?:(?:(?:[a-zA-Zd$-_.
!~*'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA-Zd$-_.!~*
'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*))*)?)))?)|(?:/(?:(?:(?:(?:(?:[a-zA
-Zd$-_.!~*'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA-Z
d$-_.!~*'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*))*)?))|(?:(?:(?:(?:(?:[a
-zA-Zd$-_.!~*'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA
-Zd$-_.!~*'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*))*)?)))
Given its complexibility, I think you should go the urlparse way.
For completeness, here’s the pseudo-BNF of the above regex (as a documentation):
; The generic form of a URL is: genericurl = scheme ":" schemepart ; Specific predefined schemes are defined here; new schemes ; may be registered with IANA url = httpurl | ftpurl | newsurl | nntpurl | telneturl | gopherurl | waisurl | mailtourl | fileurl | prosperourl | otherurl ; new schemes follow the general syntax otherurl = genericurl ; the scheme is in lower case; interpreters should use case-ignore scheme = 1*[ lowalpha | digit | "+" | "-" | "." ] schemepart = *xchar | ip-schemepart ; URL schemeparts for ip based protocols: ip-schemepart = "//" login [ "/" urlpath ] login = [ user [ ":" password ] "@" ] hostport hostport = host [ ":" port ] host = hostname | hostnumber hostname = *[ domainlabel "." ] toplabel domainlabel = alphadigit | alphadigit *[ alphadigit | "-" ] alphadigit toplabel = alpha | alpha *[ alphadigit | "-" ] alphadigit alphadigit = alpha | digit hostnumber = digits "." digits "." digits "." digits port = digits user = *[ uchar | ";" | "?" | "&" | "=" ] password = *[ uchar | ";" | "?" | "&" | "=" ] urlpath = *xchar ; depends on protocol see section 3.1 ; The predefined schemes: ; FTP (see also RFC959) ftpurl = "ftp://" login [ "/" fpath [ ";type=" ftptype ]] fpath = fsegment *[ "/" fsegment ] fsegment = *[ uchar | "?" | ":" | "@" | "&" | "=" ] ftptype = "A" | "I" | "D" | "a" | "i" | "d" ; FILE fileurl = "file://" [ host | "localhost" ] "/" fpath ; HTTP httpurl = "http://" hostport [ "/" hpath [ "?" search ]] hpath = hsegment *[ "/" hsegment ] hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ] search = *[ uchar | ";" | ":" | "@" | "&" | "=" ] ; GOPHER (see also RFC1436) gopherurl = "gopher://" hostport [ / [ gtype [ selector [ "%09" search [ "%09" gopher+_string ] ] ] ] ] gtype = xchar selector = *xchar gopher+_string = *xchar ; MAILTO (see also RFC822) mailtourl = "mailto:" encoded822addr encoded822addr = 1*xchar ; further defined in RFC822 ; NEWS (see also RFC1036) newsurl = "news:" grouppart grouppart = "*" | group | article group = alpha *[ alpha | digit | "-" | "." | "+" | "_" ] article = 1*[ uchar | ";" | "/" | "?" | ":" | "&" | "=" ] "@" host ; NNTP (see also RFC977) nntpurl = "nntp://" hostport "/" group [ "/" digits ] ; TELNET telneturl = "telnet://" login [ "/" ] ; WAIS (see also RFC1625) waisurl = waisdatabase | waisindex | waisdoc waisdatabase = "wais://" hostport "/" database waisindex = "wais://" hostport "/" database "?" search waisdoc = "wais://" hostport "/" database "/" wtype "/" wpath database = *uchar wtype = *uchar wpath = *uchar ; PROSPERO prosperourl = "prospero://" hostport "/" ppath *[ fieldspec ] ppath = psegment *[ "/" psegment ] psegment = *[ uchar | "?" | ":" | "@" | "&" | "=" ] fieldspec = ";" fieldname "=" fieldvalue fieldname = *[ uchar | "?" | ":" | "@" | "&" ] fieldvalue = *[ uchar | "?" | ":" | "@" | "&" ] ; Miscellaneous definitions lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" alpha = lowalpha | hialpha digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" safe = "$" | "-" | "_" | "." | "+" extra = "!" | "*" | "'" | "(" | ")" | "," national = "{" | "}" | "|" | "" | "^" | "~" | "[" | "]" | "`" punctuation = "" | "#" | "%" | reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | "a" | "b" | "c" | "d" | "e" | "f" escape = "%" hex hex unreserved = alpha | digit | safe | extra uchar = unreserved | escape xchar = unreserved | reserved | escape digits = 1*digit
urlparse
quite happily takes invalid URLs, it is more a string string-splitting library than any kind of validator. For example:
from urlparse import urlparse
urlparse('http://----')
# returns: ParseResult(scheme='http', netloc='----', path='', params='', query='', fragment='')
Depending on the situation, this might be fine..
If you mostly trust the data, and just want to verify the protocol is HTTP, then urlparse
is perfect.
If you want to make the URL is actually a legal URL, use the ridiculous regex
If you want to make sure it’s a real web address,
import urllib
try:
urllib.urlopen(url)
except IOError:
print "Not a real URL"
note – Lepl is no longer maintained or supported.
RFC 3696 defines “best practices” for URL validation – http://www.faqs.org/rfcs/rfc3696.html
The latest release of Lepl (a Python parser library) includes an implementation of RFC 3696. You would use it something like:
from lepl.apps.rfc3696 import Email, HttpUrl
# compile the validators (do once at start of program)
valid_email = Email()
valid_http_url = HttpUrl()
# use the validators (as often as you like)
if valid_email(some_email):
# email is ok
else:
# email is bad
if valid_http_url(some_url):
# url is ok
else:
# url is bad
Although the validators are defined in Lepl, which is a recursive descent parser, they are largely compiled internally to regular expressions. That combines the best of both worlds – a (relatively) easy to read definition that can be checked against RFC 3696 and an efficient implementation. There’s a post on my blog showing how this simplifies the parser – http://www.acooke.org/cute/LEPLOptimi0.html
Lepl is available at http://www.acooke.org/lepl and the RFC 3696 module is documented at http://www.acooke.org/lepl/rfc3696.html
This is completely new in this release, so may contain bugs. Please contact me if you have any problems and I will fix them ASAP. Thanks.
I’m using the one used by Django and it seems to work pretty well:
def is_valid_url(url):
import re
regex = re.compile(
r'^https?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?.)+[A-Z]{2,6}.?|' # domain...
r'localhost|' # localhost...
r'd{1,3}.d{1,3}.d{1,3}.d{1,3})' # ...or ip
r'(?::d+)?' # optional port
r'(?:/?|[/?]S+)$', re.IGNORECASE)
return url is not None and regex.search(url)
You can always check the latest version here: https://github.com/django/django/blob/master/django/core/validators.py#L74
http://pypi.python.org/pypi/rfc3987 gives regular expressions for consistency with the rules in RFC 3986 and RFC 3987 (that is, not with scheme-specific rules).
A regexp for IRI_reference is:
(?P<scheme>[a-zA-Z][a-zA-Z0-9+.-]*):(?://(?P<iauthority>(?:(?P<iuserinfo>(?:(?:[
a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU0002
0000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU
00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009ff
fdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U00
0dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:)*)@)?(?P<ihost>
[(?:(?:[0-9A-F]{1,4}:){6}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4]
[0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|::(?:[0
-9A-F]{1,4}:){5}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]
?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|[0-9A-F]{1,4}?::(
?:[0-9A-F]{1,4}:){4}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|
[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F
]{1,4}:)?[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){3}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?
:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[
0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,2}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){2}(?:
[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3
}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,3}[0-9A-F]{1,
4})?::(?:[0-9A-F]{1,4}:)(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0
-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-
9A-F]{1,4}:){,4}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]
|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|
(?:(?:[0-9A-F]{1,4}:){,5}[0-9A-F]{1,4})?::[0-9A-F]{1,4}|(?:(?:[0-9A-F]{1,4}:){,6
}[0-9A-F]{1,4})?::|v[0-9A-F]+\.(?:[a-zA-Z0-9_.~-]|[!$&'()*+,;=]|:)+)\]|(?:(?:(
?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][
0-9]?))|(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-
U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU000500
00-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00
090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffd
U000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=])*)(
?::(?P<port>[0-9]*))?)(?P<ipath>(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-uf
dcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffd
U00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007f
ffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U0
00bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-
F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>/(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7
ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000
-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU0007
0000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU
000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000eff
fd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ff
uf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-
U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU000700
00-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU00
0b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd
])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)?)|(?P<ipath>(?:(?:[a-zA-Z0-9._~-]|[
xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU
00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006ff
fdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U00
0afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-
U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa
0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00
030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffd
U00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000a
fffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U
000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>))(?:\?(?P<iquery
>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U000
1fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-
U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU000900
00-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU00
0d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|[
ue000-uf8ffU000f0000-U000ffffdU00100000-U0010fffd]|/|\?)*))?(?:\#(?P<ifra
gment>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-
U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050
000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU0
0090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfff
dU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|
@)|/|\?)*))?|(?:(?://(?P<iauthority>(?:(?P<iuserinfo>(?:(?:[a-zA-Z0-9._~-]|[xa
0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00
030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffd
U00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000a
fffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U
000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:)*)@)?(?P<ihost>\[(?:(?:[0-9A-F]{1,
4}:){6}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-
9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|::(?:[0-9A-F]{1,4}:){5}(?:
[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3
}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|[0-9A-F]{1,4}?::(?:[0-9A-F]{1,4}:){4
}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\
.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:)?[0-9A-F]{1
,4})?::(?:[0-9A-F]{1,4}:){3}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-
4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?
:[0-9A-F]{1,4}:){,2}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){2}(?:[0-9A-F]{1,4}:[0-9A
-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][
0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,3}[0-9A-F]{1,4})?::(?:[0-9A-F]{1
,4}:)(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]
?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,4}[0-
9A-F]{1,4})?::(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[
0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}
:){,5}[0-9A-F]{1,4})?::[0-9A-F]{1,4}|(?:(?:[0-9A-F]{1,4}:){,6}[0-9A-F]{1,4})?::|
v[0-9A-F]+\.(?:[a-zA-Z0-9_.~-]|[!$&'()*+,;=]|:)+)\]|(?:(?:(?:25[0-5]|2[0-4][0-
9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(?:(?:[a-zA
-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000
-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU0006
0000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU
000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dff
fdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=])*)(?::(?P<port>[0-9]*)
)?)(?P<ipath>(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU0
0010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fff
dU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U000
8fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-
U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*
+,;=]|:|@)*)*)|(?P<ipath>/(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufd
f0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU000400
00-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00
080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffd
U000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A
-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0
-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000
-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU0008
0000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU
000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F
]|[!$&'()*+,;=]|:|@)*)*)?)|(?P<ipath>(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-u
fdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffd
U00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007
fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U
000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A
-F][0-9A-F]|[!$&'()*+,;=]|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcf
ufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00
040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffd
U00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000b
fffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][
0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>))(?:\?(?P<iquery>(?:(?:(?:[a-zA-Z0-9.
_~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U000
2fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-
U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a00
00-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU00
0e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|[ue000-uf8ffU000f000
0-U000ffffdU00100000-U0010fffd]|/|\?)*))?(?:\#(?P<ifragment>(?:(?:(?:[a-zA-
Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-
U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060
000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU0
00a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfff
dU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|/|\?)*))?)
In one line:
(?P<scheme>[a-zA-Z][a-zA-Z0-9+.-]*):(?://(?P<iauthority>(?:(?P<iuserinfo>(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:)*)@)?(?P<ihost>\[(?:(?:[0-9A-F]{1,4}:){6}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|::(?:[0-9A-F]{1,4}:){5}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|[0-9A-F]{1,4}?::(?:[0-9A-F]{1,4}:){4}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:)?[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){3}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,2}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){2}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,3}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:)(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,4}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,5}[0-9A-F]{1,4})?::[0-9A-F]{1,4}|(?:(?:[0-9A-F]{1,4}:){,6}[0-9A-F]{1,4})?::|v[0-9A-F]+\.(?:[a-zA-Z0-9_.~-]|[!$&'()*+,;=]|:)+)\]|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=])*)(?::(?P<port>[0-9]*))?)(?P<ipath>(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>/(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)?)|(?P<ipath>(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>))(?:\?(?P<iquery>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|[ue000-uf8ffU000f0000-U000ffffdU00100000-U0010fffd]|/|\?)*))?(?:\#(?P<ifragment>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|/|\?)*))?|(?:(?://(?P<iauthority>(?:(?P<iuserinfo>(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:)*)@)?(?P<ihost>\[(?:(?:[0-9A-F]{1,4}:){6}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|::(?:[0-9A-F]{1,4}:){5}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|[0-9A-F]{1,4}?::(?:[0-9A-F]{1,4}:){4}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:)?[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){3}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,2}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){2}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,3}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:)(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,4}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,5}[0-9A-F]{1,4})?::[0-9A-F]{1,4}|(?:(?:[0-9A-F]{1,4}:){,6}[0-9A-F]{1,4})?::|v[0-9A-F]+\.(?:[a-zA-Z0-9_.~-]|[!$&'()*+,;=]|:)+)\]|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=])*)(?::(?P<port>[0-9]*))?)(?P<ipath>(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>/(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)?)|(?P<ipath>(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>))(?:\?(?P<iquery>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|[ue000-uf8ffU000f0000-U000ffffdU00100000-U0010fffd]|/|\?)*))?(?:\#(?P<ifragment>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-ud7ffuf900-ufdcfufdf0-uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|/|\?)*))?)
Nowadays, in 90% of case if you working with URL in Python you probably use python-requests. Hence the question here – why not reuse URL validation from requests?
from requests.models import PreparedRequest
import requests.exceptions
def check_url(url):
prepared_request = PreparedRequest()
try:
prepared_request.prepare_url(url, None)
return prepared_request.url
except requests.exceptions.MissingSchema, e:
raise SomeException
Features:
- Don’t reinvent the wheel
- DRY
- Work offline
- Minimal resource
modified django url validation regex:
import re
ul = 'u00a1-uffff' # unicode letters range (must not be a raw string)
# IP patterns
ipv4_re = r'(?:25[0-5]|2[0-4]d|[0-1]?d?d)(?:.(?:25[0-5]|2[0-4]d|[0-1]?d?d)){3}'
ipv6_re = r'[[0-9a-f:.]+]'
# Host patterns
hostname_re = r'[a-z' + ul + r'0-9](?:[a-z' + ul + r'0-9-]{0,61}[a-z' + ul + r'0-9])?'
domain_re = r'(?:.(?!-)[a-z' + ul + r'0-9-]{1,63}(?<!-))*' # domain names have max length of 63 characters
tld_re = (
r'.' # dot
r'(?!-)' # can't start with a dash
r'(?:[a-z' + ul + '-]{2,63}' # domain label
r'|xn--[a-z0-9]{1,59})' # or punycode label
r'(?<!-)' # can't end with a dash
r'.?' # may have a trailing dot
)
host_re = '(' + hostname_re + domain_re + tld_re + '|localhost)'
regex = re.compile(
r'^(?:http|ftp)s?://' # http(s):// or ftp(s)://
r'(?:S+(?::S*)?@)?' # user:pass authentication
r'(?:' + ipv4_re + '|' + ipv6_re + '|' + host_re + ')' # localhost or ip
r'(?::d{2,5})?' # optional port
r'(?:[/?#][^s]*)?' # resource path
r'Z', re.IGNORECASE)
source: https://github.com/django/django/blob/master/django/core/validators.py#L74