Test if children tag exists in beautifulsoup
Question:
i have an XML file with an defined structure but different number of tags, like
file1.xml:
<document>
<subDoc>
<id>1</id>
<myId>1</myId>
</subDoc>
</document>
file2.xml:
<document>
<subDoc>
<id>2</id>
</subDoc>
</document>
Now i like to check, if the tag myId
exits. So i did the following:
data = open("file1.xml",'r').read()
xml = BeautifulSoup(data)
hasAttrBs = xml.document.subdoc.has_attr('myID')
hasAttrPy = hasattr(xml.document.subdoc,'myID')
hasType = type(xml.document.subdoc.myid)
The result is for
file1.xml:
hasAttrBs -> False
hasAttrPy -> True
hasType -> <class 'bs4.element.Tag'>
file2.xml:
hasAttrBs -> False
hasAttrPy -> True
hasType -> <type 'NoneType'>
Okay, <myId>
is not an attribute of <subdoc>
.
But how i can test, if an sub-tag exists?
//Edit: By the way: I’m don’t really like to iterate trough the whole subdoc, because that will be very slow. I hope to find an way where I can direct address/ask that element.
Answers:
you can handle it like this:
for child in xml.document.subdoc.children:
if 'myId' == child.name:
return True
The simplest way to find if a child tag exists is simply
childTag = xml.find('childTag')
if childTag:
# do stuff
More specifically to OP’s question:
If you don’t know the structure of the XML doc, you can use the .find()
method of the soup. Something like this:
with open("file1.xml",'r') as data, open("file2.xml",'r') as data2:
xml = BeautifulSoup(data.read())
xml2 = BeautifulSoup(data2.read())
hasAttrBs = xml.find("myId")
hasAttrBs2 = xml2.find("myId")
If you do know the structure, you can get the desired element by accessing the tag name as an attribute like this xml.document.subdoc.myid
. So the whole thing would go something like this:
with open("file1.xml",'r') as data, open("file2.xml",'r') as data2:
xml = BeautifulSoup(data.read())
xml2 = BeautifulSoup(data2.read())
hasAttrBs = xml.document.subdoc.myid
hasAttrBs2 = xml2.document.subdoc.myid
print hasAttrBs
print hasAttrBs2
Prints
<myid>1</myid>
None
if tag.find('child_tag_name'):
Here’s an example to check if h2 tag exists in an Instagram URL. Hope you find it useful:
import datetime
import urllib
import requests
from bs4 import BeautifulSoup
instagram_url = 'https://www.instagram.com/p/BHijrYFgX2v/?taken-by=findingmero'
html_source = requests.get(instagram_url).text
soup = BeautifulSoup(html_source, "lxml")
if not soup.find('h2'):
print("didn't find h2")
You can do it with if tag.myID:
If you want to check if myID
is the direct child not child of child use if tag.find("myID", recursive=False):
If you want to check if tag has no child, use if tag.find(True):
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page
soup = BeautifulSoup(page.content, 'html.parser')
testNode = list(soup.children)[1]
def hasChild(node):
print(type(node))
try:
node.children
return True
except:
return False
if( hasChild(testNode) ):
firstChild=list(testNode.children)[0]
if( hasChild(firstChild) ):
print('I found Grand Child ')
if you are using a CSS selector
content = soup_elm.select('.css_selector')
if len(content) == 0:
return None
You could also try it this way :
response = requests.get("Your URL here")
soup = BeautifulSoup(response.text,'lxml')
RESULT = soup.select_one('CSS_SELECTOR_HERE') # for one element search
print(RESULT)
Note that the CSS Selector for Bs4 is a little different to other selector methods.
Click Here for documentation on how to use CSS selectors.
soup.select
works for an all element selection and works for elements with attributes as well.
i have an XML file with an defined structure but different number of tags, like
file1.xml:
<document>
<subDoc>
<id>1</id>
<myId>1</myId>
</subDoc>
</document>
file2.xml:
<document>
<subDoc>
<id>2</id>
</subDoc>
</document>
Now i like to check, if the tag myId
exits. So i did the following:
data = open("file1.xml",'r').read()
xml = BeautifulSoup(data)
hasAttrBs = xml.document.subdoc.has_attr('myID')
hasAttrPy = hasattr(xml.document.subdoc,'myID')
hasType = type(xml.document.subdoc.myid)
The result is for
file1.xml:
hasAttrBs -> False
hasAttrPy -> True
hasType -> <class 'bs4.element.Tag'>
file2.xml:
hasAttrBs -> False
hasAttrPy -> True
hasType -> <type 'NoneType'>
Okay, <myId>
is not an attribute of <subdoc>
.
But how i can test, if an sub-tag exists?
//Edit: By the way: I’m don’t really like to iterate trough the whole subdoc, because that will be very slow. I hope to find an way where I can direct address/ask that element.
you can handle it like this:
for child in xml.document.subdoc.children:
if 'myId' == child.name:
return True
The simplest way to find if a child tag exists is simply
childTag = xml.find('childTag')
if childTag:
# do stuff
More specifically to OP’s question:
If you don’t know the structure of the XML doc, you can use the .find()
method of the soup. Something like this:
with open("file1.xml",'r') as data, open("file2.xml",'r') as data2:
xml = BeautifulSoup(data.read())
xml2 = BeautifulSoup(data2.read())
hasAttrBs = xml.find("myId")
hasAttrBs2 = xml2.find("myId")
If you do know the structure, you can get the desired element by accessing the tag name as an attribute like this xml.document.subdoc.myid
. So the whole thing would go something like this:
with open("file1.xml",'r') as data, open("file2.xml",'r') as data2:
xml = BeautifulSoup(data.read())
xml2 = BeautifulSoup(data2.read())
hasAttrBs = xml.document.subdoc.myid
hasAttrBs2 = xml2.document.subdoc.myid
print hasAttrBs
print hasAttrBs2
Prints
<myid>1</myid>
None
if tag.find('child_tag_name'):
Here’s an example to check if h2 tag exists in an Instagram URL. Hope you find it useful:
import datetime
import urllib
import requests
from bs4 import BeautifulSoup
instagram_url = 'https://www.instagram.com/p/BHijrYFgX2v/?taken-by=findingmero'
html_source = requests.get(instagram_url).text
soup = BeautifulSoup(html_source, "lxml")
if not soup.find('h2'):
print("didn't find h2")
You can do it with if tag.myID:
If you want to check if myID
is the direct child not child of child use if tag.find("myID", recursive=False):
If you want to check if tag has no child, use if tag.find(True):
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page
soup = BeautifulSoup(page.content, 'html.parser')
testNode = list(soup.children)[1]
def hasChild(node):
print(type(node))
try:
node.children
return True
except:
return False
if( hasChild(testNode) ):
firstChild=list(testNode.children)[0]
if( hasChild(firstChild) ):
print('I found Grand Child ')
if you are using a CSS selector
content = soup_elm.select('.css_selector')
if len(content) == 0:
return None
You could also try it this way :
response = requests.get("Your URL here")
soup = BeautifulSoup(response.text,'lxml')
RESULT = soup.select_one('CSS_SELECTOR_HERE') # for one element search
print(RESULT)
Note that the CSS Selector for Bs4 is a little different to other selector methods.
Click Here for documentation on how to use CSS selectors.
soup.select
works for an all element selection and works for elements with attributes as well.