How to use regex to parse a number from HTML?
Question:
I want to write a simple regular expression in Python that extracts a number from HTML. The HTML sample is as follows:
Your number is <b>123</b>
Now, how can I extract "123", i.e. the contents of the first bold text after the string "Your number is"?
Answers:
Given s = "Your number is <b>123</b>"
then:
import re
m = re.search(r"d+", s)
will work and give you
m.group()
'123'
The regular expression looks for 1 or more consecutive digits in your string.
Note that in this specific case we knew that there would be a numeric sequence, otherwise you would have to test the return value of re.search()
to make sure that m
contained a valid reference, otherwise m.group()
would result in a AttributeError:
exception.
Of course if you are going to process a lot of HTML you want to take a serious look at BeautifulSoup – it’s meant for that and much more. The whole idea with BeautifulSoup is to avoid “manual” parsing using string ops or regular expressions.
import re
m = re.search("Your number is <b>(d+)</b>",
"xxx Your number is <b>123</b> fdjsk")
if m:
print m.groups()[0]
import re
x = 'Your number is <b>123</b>'
re.search('(?<=Your number is )<b>(d+)</b>',x).group(0)
this searches for the number that follows the ‘Your number is’ string
import re
print re.search(r'(d+)', 'Your number is <b>123</b>').group(0)
val="Your number is <b>123</b>"
Option : 1
m=re.search(r'(<.*?>)(d+)(<.*?>)',val)
m.group(2)
Option : 2
re.sub(r'([sS]+)(<.*?>)(d+)(<.*?>)',r'3',val)
The simplest way is just extract digit(number)
re.search(r"d+",text)
import re
found = re.search("your number is <b>(d+)</b>", "something.... Your number is <b>123</b> something...")
if found:
print found.group()[0]
Here (d+) is the grouping, since there is only one group [0]
is used. When there are several groupings [grouping index]
should be used.
You can use the following example to solve your problem:
import re
search = re.search(r"d+",text).group(0) #returns the number that is matched in the text
print("Starting Index Of Digit", search.start())
print("Ending Index Of Digit:", search.end())
To extract as python list you can use findall
>>> import re
>>> string = 'Your number is <b>123</b>'
>>> pattern = 'd+'
>>> re.findall(pattern,string)
['123']
>>>
import re
x = 'Your number is <b>123</b>'
output = re.search('(?<=Your number is )<b>(d+)</b>',x).group(1)
print(output)
I want to write a simple regular expression in Python that extracts a number from HTML. The HTML sample is as follows:
Your number is <b>123</b>
Now, how can I extract "123", i.e. the contents of the first bold text after the string "Your number is"?
Given s = "Your number is <b>123</b>"
then:
import re
m = re.search(r"d+", s)
will work and give you
m.group()
'123'
The regular expression looks for 1 or more consecutive digits in your string.
Note that in this specific case we knew that there would be a numeric sequence, otherwise you would have to test the return value of re.search()
to make sure that m
contained a valid reference, otherwise m.group()
would result in a AttributeError:
exception.
Of course if you are going to process a lot of HTML you want to take a serious look at BeautifulSoup – it’s meant for that and much more. The whole idea with BeautifulSoup is to avoid “manual” parsing using string ops or regular expressions.
import re
m = re.search("Your number is <b>(d+)</b>",
"xxx Your number is <b>123</b> fdjsk")
if m:
print m.groups()[0]
import re
x = 'Your number is <b>123</b>'
re.search('(?<=Your number is )<b>(d+)</b>',x).group(0)
this searches for the number that follows the ‘Your number is’ string
import re
print re.search(r'(d+)', 'Your number is <b>123</b>').group(0)
val="Your number is <b>123</b>"
Option : 1
m=re.search(r'(<.*?>)(d+)(<.*?>)',val)
m.group(2)
Option : 2
re.sub(r'([sS]+)(<.*?>)(d+)(<.*?>)',r'3',val)
The simplest way is just extract digit(number)
re.search(r"d+",text)
import re
found = re.search("your number is <b>(d+)</b>", "something.... Your number is <b>123</b> something...")
if found:
print found.group()[0]
Here (d+) is the grouping, since there is only one group [0]
is used. When there are several groupings [grouping index]
should be used.
You can use the following example to solve your problem:
import re
search = re.search(r"d+",text).group(0) #returns the number that is matched in the text
print("Starting Index Of Digit", search.start())
print("Ending Index Of Digit:", search.end())
To extract as python list you can use findall
>>> import re
>>> string = 'Your number is <b>123</b>'
>>> pattern = 'd+'
>>> re.findall(pattern,string)
['123']
>>>
import re
x = 'Your number is <b>123</b>'
output = re.search('(?<=Your number is )<b>(d+)</b>',x).group(1)
print(output)