How to dynamically find the nearest specific parent of a selected element?

Question:

I want to parse many html pages and remove a div that contains the text "Message", using beautifulsoup html.parser and python. The div has no name or id, so pointing to it is not possible. I am able to do this for 1 html page. In the code below, you will see 6 .parent . This is because there are 5 tags (p,i,b,span,a) between div tag and the text "Message", and 6th tag is div, in this html page. The code below works fine for 1 html page.

soup = BeautifulSoup(html_page,"html.parser")
scores = soup.find_all(text=re.compile('Message'))
divs = [score.parent.parent.parent.parent.parent.parent for score in scores]
divs.decompose()

The problem is – The number of tags between div and "Message" is not always 6. In some html page its 3, and in some 7.

So, is there a way to find the number of tags (n) between the text "Message" and nearest div to the left dynamically, and add n+1 number of .parent to score (in the code above) using python, beautifulsoup?

Asked By: Newbie

||

Answers:

As described in your question, that there is no other <div> between, you could use .find_parent():

soup.find(text=re.compile('Message')).find_parent('div').decompose()

Be aware, that if you use find_all() you have to iterate your ResultSet while unsing .find_parent():

for r in soup.find_all(text=re.compile('Message')):
    r.find_parent('div').decompose()

As in your example divs.decompose() – You also should iterate the list.

Example

from bs4 import BeautifulSoup
import re
html='''
<div>
    <span>
        <i>
            <x>Message</x>
        </i>
    </span>
</div>
'''
soup = BeautifulSoup(html)

soup.find(text=re.compile('Message')).find_parent('div')
Answered By: HedgeHog