How to extract an IP address from an HTML string?
Question:
I want to extract an IP address from a string (actually a one-line HTML) using Python.
>>> s = "<html><head><title>Current IP Check</title></head><body>Current IP Address: 165.91.15.131</body></html>"
— ‘165.91.15.131’ is what I want!
I tried using regular expressions, but so far I can only get to the first number.
>>> import re
>>> ip = re.findall( r'([0-9]+)(?:.[0-9]+){3}', s )
>>> ip
['165']
But I don’t have a firm grasp on reg-expression; the above code was found and modified from elsewhere on the web.
Answers:
Remove your capturing group:
ip = re.findall( r'[0-9]+(?:.[0-9]+){3}', s )
Result:
['165.91.15.131']
Notes:
- If you are parsing HTML it might be a good idea to look at BeautifulSoup.
- Your regular expression matches some invalid IP addresses such as
0.00.999.9999
. This isn’t necessarily a problem, but you should be aware of it and possibly handle this situation. You could change the +
to {1,3}
for a partial fix without making the regular expression overly complex.
You can use the following regex to capture only valid IP addresses
re.findall(r'b25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?.25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?.25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?.25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?b',s)
returns
['165', '91', '15', '131']
import re
ipPattern = re.compile('d{1,3}.d{1,3}.d{1,3}.d{1,3}')
findIP = re.findall(ipPattern,s)
findIP contains ['165.91.15.131']
This is how I’ve done it. I think it’s so clean
import re
import urllib2
def getIP():
ip_checker_url = "http://checkip.dyndns.org/"
address_regexp = re.compile ('d{1,3}.d{1,3}.d{1,3}.d{1,3}')
response = urllib2.urlopen(ip_checker_url).read()
result = address_regexp.search(response)
if result:
return result.group()
else:
return None
get_IP() returns ip into a string or None
You can substitute address_regexp for other regular expressions if you prefer a more accurate parsing or maybe change the web service provider.
easiest way to find the ip address from the log..
s = "<html><head><title>Current IP Check</title></head><body>Current IP Address: 165.91.15.131</body></html>"
info = re.findall(r'[d.-]+', s)
In [42]: info
Out[42]: [‘165.91.15.131’]
You can use following regex to extract valid IP without following errors
1.Some detected 123.456.789.111
as valid IP
2.Some don’t detect 127.0.00.1
as valid IP
3.Some don’t detect IP that start with zero like 08.8.8.8
So here I post a regex that works on all above conditions.
Note : I have extracted more than 2 millions IP without any problem with following regex.
(?:(?:1dd|2[0-5][0-5]|2[0-4]d|0?[1-9]d|0?0?d).){3}(?:1dd|2[0-5][0-5]|2[0-4]d|0?[1-9]d|0?0?d)
I want to extract an IP address from a string (actually a one-line HTML) using Python.
>>> s = "<html><head><title>Current IP Check</title></head><body>Current IP Address: 165.91.15.131</body></html>"
— ‘165.91.15.131’ is what I want!
I tried using regular expressions, but so far I can only get to the first number.
>>> import re
>>> ip = re.findall( r'([0-9]+)(?:.[0-9]+){3}', s )
>>> ip
['165']
But I don’t have a firm grasp on reg-expression; the above code was found and modified from elsewhere on the web.
Remove your capturing group:
ip = re.findall( r'[0-9]+(?:.[0-9]+){3}', s )
Result:
['165.91.15.131']
Notes:
- If you are parsing HTML it might be a good idea to look at BeautifulSoup.
- Your regular expression matches some invalid IP addresses such as
0.00.999.9999
. This isn’t necessarily a problem, but you should be aware of it and possibly handle this situation. You could change the+
to{1,3}
for a partial fix without making the regular expression overly complex.
You can use the following regex to capture only valid IP addresses
re.findall(r'b25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?.25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?.25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?.25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?b',s)
returns
['165', '91', '15', '131']
import re
ipPattern = re.compile('d{1,3}.d{1,3}.d{1,3}.d{1,3}')
findIP = re.findall(ipPattern,s)
findIP contains ['165.91.15.131']
This is how I’ve done it. I think it’s so clean
import re
import urllib2
def getIP():
ip_checker_url = "http://checkip.dyndns.org/"
address_regexp = re.compile ('d{1,3}.d{1,3}.d{1,3}.d{1,3}')
response = urllib2.urlopen(ip_checker_url).read()
result = address_regexp.search(response)
if result:
return result.group()
else:
return None
get_IP() returns ip into a string or None
You can substitute address_regexp for other regular expressions if you prefer a more accurate parsing or maybe change the web service provider.
easiest way to find the ip address from the log..
s = "<html><head><title>Current IP Check</title></head><body>Current IP Address: 165.91.15.131</body></html>"
info = re.findall(r'[d.-]+', s)
In [42]: info
Out[42]: [‘165.91.15.131’]
You can use following regex to extract valid IP without following errors
1.Some detected 123.456.789.111
as valid IP
2.Some don’t detect 127.0.00.1
as valid IP
3.Some don’t detect IP that start with zero like 08.8.8.8
So here I post a regex that works on all above conditions.
Note : I have extracted more than 2 millions IP without any problem with following regex.
(?:(?:1dd|2[0-5][0-5]|2[0-4]d|0?[1-9]d|0?0?d).){3}(?:1dd|2[0-5][0-5]|2[0-4]d|0?[1-9]d|0?0?d)