Python regular expression again – match URL
Question:
I have such a regular expression:
re.compile(r"((https?):((//)|(\\))+[wd:#@%/;$()~_?+-=\.&]*)", re.MULTILINE|re.UNICODE)
But that doesn’t include hashbangs (#!)
. What do I need to change to get it working? I know I can add !
to a group with #@%
, etc., but that will select something like
Check this out: http://example.com/something/!!!
And I want to avoid that.
Answers:
Don’t try to make your own regular expression for matching URLs. Use someone else’s who has already solved such problems, like this one.
I’ll admit that I’m a little bit worried about an application that requires a regex like that to match URLs. That said, this seems to work for me:
((https?):((//)|(\\))+([wd:#@%/;$()~_?+-=\.&](#!)?)*)
This is a common problem. Use default libraries.
For Python, use urlparse.
It could be very long but in practice mine works pretty good. Please try this one
((http|https)://)?[a-zA-Z0-9./?:@-_=#]+.([a-zA-Z]){2,6}([a-zA-Z0-9.&/?:@-_=#])*
It matches all of the example below
http://wwww.stackoverflow.com
abc.com
http://test.test-75.1474.stackoverflow.com/
stackoverflow.com/
stackoverflow.com
[email protected]
http://www.example.com/etcetc
www.example.com/etcetc
example.com/etcetc
user:[email protected]/etcetc
(www.itmag.com)
example.com/etcetc?query=aasd
example.com/etcetc?query=aasd&dest=asds
http://stackoverflow.com/questions/6427530/regular-expression-pattern-to-
match-url-with
www/[email protected]
[email protected].
[email protected]
[email protected]
Based on this link, we can use the library validators.
For example:
import validators
valid = validators.url('https://codespeedy.com/')
if valid == True:
print("URL is valid")
else:
print("Invalid URL")
This is the most complete pattern I use:
URL_PATTERN = r'[A-Za-z0-9]+://[A-Za-z0-9%-_]+(/[A-Za-z0-9%-_])*(#|\?)[A-Za-z0-9%-_&=]*'
I use this to search for all HTTP and HTTPS URLs. It works like a charm.
URL_PATTERN = "http[s]*S+"
I have such a regular expression:
re.compile(r"((https?):((//)|(\\))+[wd:#@%/;$()~_?+-=\.&]*)", re.MULTILINE|re.UNICODE)
But that doesn’t include hashbangs (#!)
. What do I need to change to get it working? I know I can add !
to a group with #@%
, etc., but that will select something like
Check this out: http://example.com/something/!!!
And I want to avoid that.
Don’t try to make your own regular expression for matching URLs. Use someone else’s who has already solved such problems, like this one.
I’ll admit that I’m a little bit worried about an application that requires a regex like that to match URLs. That said, this seems to work for me:
((https?):((//)|(\\))+([wd:#@%/;$()~_?+-=\.&](#!)?)*)
This is a common problem. Use default libraries.
For Python, use urlparse.
It could be very long but in practice mine works pretty good. Please try this one
((http|https)://)?[a-zA-Z0-9./?:@-_=#]+.([a-zA-Z]){2,6}([a-zA-Z0-9.&/?:@-_=#])*
It matches all of the example below
http://wwww.stackoverflow.com
abc.com
http://test.test-75.1474.stackoverflow.com/
stackoverflow.com/
stackoverflow.com
[email protected]
http://www.example.com/etcetc
www.example.com/etcetc
example.com/etcetc
user:[email protected]/etcetc
(www.itmag.com)
example.com/etcetc?query=aasd
example.com/etcetc?query=aasd&dest=asds
http://stackoverflow.com/questions/6427530/regular-expression-pattern-to-
match-url-with
www/[email protected]
[email protected].
[email protected]
[email protected]
Based on this link, we can use the library validators.
For example:
import validators
valid = validators.url('https://codespeedy.com/')
if valid == True:
print("URL is valid")
else:
print("Invalid URL")
This is the most complete pattern I use:
URL_PATTERN = r'[A-Za-z0-9]+://[A-Za-z0-9%-_]+(/[A-Za-z0-9%-_])*(#|\?)[A-Za-z0-9%-_&=]*'
I use this to search for all HTTP and HTTPS URLs. It works like a charm.
URL_PATTERN = "http[s]*S+"