Identify a unique tag using BeautifulSoup
Question:
BeautifulSoup treats two tags as identical if they both contain the exact same content, even when the two tags are not the same DOM node.
Example:
from bs4 import BeautifulSoup
x = '<div class="a"><span>hello</span></div><div class="b"><span>hello</span></div>'
page = BeautifulSoup(x, 'html.parser')
spans = page.select('span')
spans[0] == spans[1] # prints True
The way I have managed to get around this is to account for their parents as well, e.g.:
spans = page.select('span')
spans[0] == spans[1] and list(spans[0].parents) == list(spans[1].parents) # prints False
However, this method – when used on a normal HTML page with many nested DOM nodes – is often an order of magnitude slower than just comparing spans[0] to spans[1] without the parents.
My question is: is there a more efficient way to determine, via Beautiful Soup, whether two nodes are truly the same one?
Answers:
You can use id()
:
print(id(spans[0]) == id(spans[1]))
Prints:
False
Or is
operator:
print(spans[0] is spans[1])
BeautifulSoup treats two tags as identical if they both contain the exact same content, even when the two tags are not the same DOM node.
Example:
from bs4 import BeautifulSoup
x = '<div class="a"><span>hello</span></div><div class="b"><span>hello</span></div>'
page = BeautifulSoup(x, 'html.parser')
spans = page.select('span')
spans[0] == spans[1] # prints True
The way I have managed to get around this is to account for their parents as well, e.g.:
spans = page.select('span')
spans[0] == spans[1] and list(spans[0].parents) == list(spans[1].parents) # prints False
However, this method – when used on a normal HTML page with many nested DOM nodes – is often an order of magnitude slower than just comparing spans[0] to spans[1] without the parents.
My question is: is there a more efficient way to determine, via Beautiful Soup, whether two nodes are truly the same one?
You can use id()
:
print(id(spans[0]) == id(spans[1]))
Prints:
False
Or is
operator:
print(spans[0] is spans[1])