Python – Using regex to find multiple matches and print them out
Question:
I need to find content of forms from HTML source file, I did some searching and found very good method to do that, but the problem is that it prints out only first found, how can I loop through it and output all form contents, not just first one?
line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
matchObj = re.search('<form>(.*?)</form>', line, re.S)
print matchObj.group(1)
# Output: Form 1
# I need it to output every form content he found, not just first one...
Answers:
Using regexes for this purpose is the wrong approach. Since you are using python you have a really awesome library available to extract parts from HTML documents: BeautifulSoup.
Do not use regular expressions to parse HTML.
But if you ever need to find all regexp matches in a string, use the findall
function.
import re
line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
matches = re.findall('<form>(.*?)</form>', line, re.DOTALL)
print(matches)
# Output: ['Form 1', 'Form 2']
Instead of using re.search
use re.findall
it will return you all matches in a List
. Or you could also use re.finditer
(which i like most to use) it will return an Iterator Object
and you can just use it to iterate over all found matches.
line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
for match in re.finditer('<form>(.*?)</form>', line, re.S):
print match.group(1)
I need to find content of forms from HTML source file, I did some searching and found very good method to do that, but the problem is that it prints out only first found, how can I loop through it and output all form contents, not just first one?
line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
matchObj = re.search('<form>(.*?)</form>', line, re.S)
print matchObj.group(1)
# Output: Form 1
# I need it to output every form content he found, not just first one...
Using regexes for this purpose is the wrong approach. Since you are using python you have a really awesome library available to extract parts from HTML documents: BeautifulSoup.
Do not use regular expressions to parse HTML.
But if you ever need to find all regexp matches in a string, use the findall
function.
import re
line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
matches = re.findall('<form>(.*?)</form>', line, re.DOTALL)
print(matches)
# Output: ['Form 1', 'Form 2']
Instead of using re.search
use re.findall
it will return you all matches in a List
. Or you could also use re.finditer
(which i like most to use) it will return an Iterator Object
and you can just use it to iterate over all found matches.
line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
for match in re.finditer('<form>(.*?)</form>', line, re.S):
print match.group(1)