How to extract all <p> with its corresponding <h2>?

Question:

I am trying to get all the <p> that come after <h2>.

I know how to do this in case I have only one <p> after <h2>, but not in case I have multiple <p>.

Here’s an example of the webpage:

<h2>Heading Text1</h2>

<p>Paragraph1</p>
<p>Paragraph2</p>

<h2>Heading Text2</h2>

<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
....

I need to get all paragraphs in relation to their headings, e.g. Paragraphs 1 and 2 that are related to Heading Text1.

I’m trying that using BeautifulSoup with Python, been trying for days, also googling.

How can this be done?

Asked By: W01v3n

||

Answers:

This is how I would do it, I will get all the h2, p tags and iterate through them saving the last h2 tag content and tying it to the paragraphs next to it.

from bs4 import BeautifulSoup

html = '''
<h2>Heading Text1</h2>

<p>Paragraph1</p>
<p>Paragraph2</p>

<h2>Heading Text2</h2>

<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
'''

soup = BeautifulSoup(html, 'html.parser')

dict_to_save = {}

# find all the 'h2' and 'p' tags
for tag in soup(['h2','p']):
    # if 'h2' tag save it into a variable named header
    if tag.name == 'h2':
        header = tag.text.strip()

    # if not 'h2' tag add this paragraph to the last header
    else:
        dict_to_save[header] = dict_to_save.get(header, []) + [tag.text.strip()]

print(dict_to_save)
{'Heading Text1': ['Paragraph1', 'Paragraph2'],
 'Heading Text2': ['Paragraph3', 'Paragraph4', 'Paragraph5']}
Answered By: Hanna

You could get your goal while working with a dict and .find_previous() – Iterate all <p>, find its previous <h2> and set it as key in your dict, than simply append the texts to its list:

d = {}
for p in soup.select('p'):
    if p.find_previous('h2'):
        if d.get(p.find_previous('h2').text) == None:
            d[p.find_previous('h2').text]=[]
    else:
        continue
    d[p.find_previous('h2').text].append(p.text)

Example

from bs4 import BeautifulSoup

html = '''
<p>Any Other Paragraph</p>
<h2>Heading Text1</h2>

<p>Paragraph1</p>
<p>Paragraph2</p>

<h2>Heading Text2</h2>

<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
'''
soup = BeautifulSoup(html)

d = {}
for p in soup.select('p'):
    if p.find_previous('h2'):
        if d.get(p.find_previous('h2').text) == None:
            d[p.find_previous('h2').text]=[]
    else:
        continue
    d[p.find_previous('h2').text].append(p.text)
d

Output

{'Heading Text1': ['Paragraph1', 'Paragraph2'],
 'Heading Text2': ['Paragraph3', 'Paragraph4', 'Paragraph5']}
Answered By: HedgeHog

This is almost identical to the question posed yesterday. You can solve this in few different ways. Here is how I would do it:

from bs4 import BeautifulSoup

html = """
<h2>Heading Text1</h2>

<p>Paragraph1</p>
<p>Paragraph2</p>

<h2>Heading Text2</h2>

<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>

<h2>Heading Text3</h2>

<p>Paragraph6</p>
<p>Paragraph7</p>
<p>Paragraph8</p>
<p>Paragraph9</p>
"""

soup = BeautifulSoup(html,"html.parser")
data_dict = {}
for item in soup.select("h2"):
    header = item.get_text(strip=True)
    content = []
    for i in item.next_siblings:
        if i.name=="h2": break
        content.extend([x for x in i.stripped_strings])
    data_dict[header] = content

print(data_dict)
Answered By: MITHU