Parsing hostname and port from string or url
Question:
I can be given a string in any of these formats:
-
url: e.g http://www.acme.com:456
-
string: e.g www.acme.com:456, www.acme.com 456, or www.acme.com
I would like to extract the host and if present a port. If the port value is not present I would like it to default to 80.
I have tried urlparse, which works fine for the url, but not for the other format. When I use urlparse on hostname:port for example, it puts the hostname in the scheme rather than netloc.
I would be happy with a solution that uses urlparse and a regex, or a single regex that could handle both formats.
Answers:
I’m not that familiar with urlparse, but using regex you’d do something like:
p = '(?:http.*://)?(?P<host>[^:/ ]+).?(?P<port>[0-9]*).*'
m = re.search(p,'http://www.abc.com:123/test')
m.group('host') # 'www.abc.com'
m.group('port') # '123'
Or, without port:
m = re.search(p,'http://www.abc.com/test')
m.group('host') # 'www.abc.com'
m.group('port') # '' i.e. you'll have to treat this as '80'
EDIT: fixed regex to also match ‘www.abc.com 123’
The reason it fails for:
www.acme.com 456
is because it is not a valid URI. Why don’t you just:
- Replace the space with a
:
- Parse the resulting string by using the standard
urlparse
method
Try and make use of default functionality as much as possible, especially when it comes to things like parsing well know formats like URI’s.
You can use urlparse to get hostname from URL string:
from urlparse import urlparse
print urlparse("http://www.website.com/abc/xyz.html").hostname # prints www.website.com
>>> from urlparse import urlparse
>>> aaa = urlparse('http://www.acme.com:456')
>>> aaa.hostname
'www.acme.com'
>>> aaa.port
456
>>>
Method using urllib –
from urllib.parse import urlparse
url = 'https://stackoverflow.com/questions'
print(urlparse(url))
Output –
ParseResult(scheme=’https’, netloc=’stackoverflow.com’,
path=’/questions’, params=”, query=”, fragment=”)
Reference – https://www.tutorialspoint.com/urllib-parse-parse-urls-into-components-in-python
I can be given a string in any of these formats:
-
url: e.g http://www.acme.com:456
-
string: e.g www.acme.com:456, www.acme.com 456, or www.acme.com
I would like to extract the host and if present a port. If the port value is not present I would like it to default to 80.
I have tried urlparse, which works fine for the url, but not for the other format. When I use urlparse on hostname:port for example, it puts the hostname in the scheme rather than netloc.
I would be happy with a solution that uses urlparse and a regex, or a single regex that could handle both formats.
I’m not that familiar with urlparse, but using regex you’d do something like:
p = '(?:http.*://)?(?P<host>[^:/ ]+).?(?P<port>[0-9]*).*'
m = re.search(p,'http://www.abc.com:123/test')
m.group('host') # 'www.abc.com'
m.group('port') # '123'
Or, without port:
m = re.search(p,'http://www.abc.com/test')
m.group('host') # 'www.abc.com'
m.group('port') # '' i.e. you'll have to treat this as '80'
EDIT: fixed regex to also match ‘www.abc.com 123’
The reason it fails for:
www.acme.com 456
is because it is not a valid URI. Why don’t you just:
- Replace the space with a
:
- Parse the resulting string by using the standard
urlparse
method
Try and make use of default functionality as much as possible, especially when it comes to things like parsing well know formats like URI’s.
You can use urlparse to get hostname from URL string:
from urlparse import urlparse
print urlparse("http://www.website.com/abc/xyz.html").hostname # prints www.website.com
>>> from urlparse import urlparse
>>> aaa = urlparse('http://www.acme.com:456')
>>> aaa.hostname
'www.acme.com'
>>> aaa.port
456
>>>
Method using urllib –
from urllib.parse import urlparse
url = 'https://stackoverflow.com/questions'
print(urlparse(url))
Output –
ParseResult(scheme=’https’, netloc=’stackoverflow.com’,
path=’/questions’, params=”, query=”, fragment=”)
Reference – https://www.tutorialspoint.com/urllib-parse-parse-urls-into-components-in-python