Matching partial ids in BeautifulSoup
Question:
I’m using BeautifulSoup. I have to find any reference to the <div>
tags with id like: post-#
.
For example:
<div id="post-45">...</div>
<div id="post-334">...</div>
I have tried:
html = '<div id="post-45">...</div> <div id="post-334">...</div>'
soupHandler = BeautifulSoup(html)
print soupHandler.findAll('div', id='post-*')
How can I filter this?
Answers:
You can pass a function to findAll:
>>> print soupHandler.findAll('div', id=lambda x: x and x.startswith('post-'))
[<div id="post-45">...</div>, <div id="post-334">...</div>]
Or a regular expression:
>>> print soupHandler.findAll('div', id=re.compile('^post-'))
[<div id="post-45">...</div>, <div id="post-334">...</div>]
soupHandler.findAll('div', id=re.compile("^post-$"))
looks right to me.
Since he is asking to match “post-#somenumber#”, it’s better to precise with
import re
[...]
soupHandler.findAll('div', id=re.compile("^post-d+"))
This works for me:
from bs4 import BeautifulSoup
import re
html = '<div id="post-45">...</div> <div id="post-334">...</div>'
soupHandler = BeautifulSoup(html)
for match in soupHandler.find_all('div', id=re.compile("post-")):
print match.get('id')
>>>
post-45
post-334
I’m using BeautifulSoup. I have to find any reference to the <div>
tags with id like: post-#
.
For example:
<div id="post-45">...</div>
<div id="post-334">...</div>
I have tried:
html = '<div id="post-45">...</div> <div id="post-334">...</div>'
soupHandler = BeautifulSoup(html)
print soupHandler.findAll('div', id='post-*')
How can I filter this?
You can pass a function to findAll:
>>> print soupHandler.findAll('div', id=lambda x: x and x.startswith('post-'))
[<div id="post-45">...</div>, <div id="post-334">...</div>]
Or a regular expression:
>>> print soupHandler.findAll('div', id=re.compile('^post-'))
[<div id="post-45">...</div>, <div id="post-334">...</div>]
soupHandler.findAll('div', id=re.compile("^post-$"))
looks right to me.
Since he is asking to match “post-#somenumber#”, it’s better to precise with
import re
[...]
soupHandler.findAll('div', id=re.compile("^post-d+"))
This works for me:
from bs4 import BeautifulSoup
import re
html = '<div id="post-45">...</div> <div id="post-334">...</div>'
soupHandler = BeautifulSoup(html)
for match in soupHandler.find_all('div', id=re.compile("post-")):
print match.get('id')
>>>
post-45
post-334