Get full text below h2 tag with BeautifulSoup

Question:

I am trying to obtain different character’s descriptions and habilities for a dataset.
The problem I’ve encountered is that there seems to be a span tag within the h2 tag and in some cases a figure before de p tags. This is the format I’m facing:

<h2><span class="mw-headline" id="Apariencia">Apariencia</span></h2>
<figure class="thumb tleft " style="width: 100px"> 
<p>...</p>
<p>...</p>
<p>...</p>
<h2><span class="mw-headline" id="Personalidad">Personalidad</span></h2>
<p>...</p>
<p>...</p>
<p>...</p>

I need to obtain the text in those paragraphs.

I have tried something like this but it obviously does not work.

import urllib.request
from bs4 import BeautifulSoup

fp = urllib.request.urlopen("https://jojo.fandom.com/es/wiki/Star_Platinum")
mybytes = fp.read()

html_doc = mybytes.decode("utf8")
fp.close()

soup = BeautifulSoup(html_doc, 'html.parser')

spans = soup.find_all('span', {"class": "mw-headline"})
for s in spans:
    print(s.nextSibling.getText)
Asked By: Iván Grzegorczyk

||

Answers:

You can search backwards for previous <h2> and store result in the dictionary:

from bs4 import BeautifulSoup


html_doc = '''
<h2><span class="mw-headline" id="Apariencia">Apariencia</span></h2>
<figure class="thumb tleft " style="width: 100px">
<p>T1 ...</p>
<p>T2 ...</p>
<p>T3 ...</p>
<h2><span class="mw-headline" id="Personalidad">Personalidad</span></h2>
<p>T4 ...</p>
<p>T5 ...</p>
<p>T6 ...</p>'''

soup = BeautifulSoup(html_doc, 'html.parser')

out = {}
for p in soup.select('p'):
    previous_h2 = p.find_previous('h2')
    out.setdefault(previous_h2.text, []).append(p.text)

print(out)

Prints:

{
    'Apariencia': ['T1 ...', 'T2 ...', 'T3 ...'], 
    'Personalidad': ['T4 ...', 'T5 ...', 'T6 ...']
}
Answered By: Andrej Kesely