Get full text below h2 tag with BeautifulSoup
Question:
I am trying to obtain different character’s descriptions and habilities for a dataset.
The problem I’ve encountered is that there seems to be a span tag within the h2 tag and in some cases a figure before de p tags. This is the format I’m facing:
<h2><span class="mw-headline" id="Apariencia">Apariencia</span></h2>
<figure class="thumb tleft " style="width: 100px">
<p>...</p>
<p>...</p>
<p>...</p>
<h2><span class="mw-headline" id="Personalidad">Personalidad</span></h2>
<p>...</p>
<p>...</p>
<p>...</p>
I need to obtain the text in those paragraphs.
I have tried something like this but it obviously does not work.
import urllib.request
from bs4 import BeautifulSoup
fp = urllib.request.urlopen("https://jojo.fandom.com/es/wiki/Star_Platinum")
mybytes = fp.read()
html_doc = mybytes.decode("utf8")
fp.close()
soup = BeautifulSoup(html_doc, 'html.parser')
spans = soup.find_all('span', {"class": "mw-headline"})
for s in spans:
print(s.nextSibling.getText)
Answers:
You can search backwards for previous <h2>
and store result in the dictionary:
from bs4 import BeautifulSoup
html_doc = '''
<h2><span class="mw-headline" id="Apariencia">Apariencia</span></h2>
<figure class="thumb tleft " style="width: 100px">
<p>T1 ...</p>
<p>T2 ...</p>
<p>T3 ...</p>
<h2><span class="mw-headline" id="Personalidad">Personalidad</span></h2>
<p>T4 ...</p>
<p>T5 ...</p>
<p>T6 ...</p>'''
soup = BeautifulSoup(html_doc, 'html.parser')
out = {}
for p in soup.select('p'):
previous_h2 = p.find_previous('h2')
out.setdefault(previous_h2.text, []).append(p.text)
print(out)
Prints:
{
'Apariencia': ['T1 ...', 'T2 ...', 'T3 ...'],
'Personalidad': ['T4 ...', 'T5 ...', 'T6 ...']
}
I am trying to obtain different character’s descriptions and habilities for a dataset.
The problem I’ve encountered is that there seems to be a span tag within the h2 tag and in some cases a figure before de p tags. This is the format I’m facing:
<h2><span class="mw-headline" id="Apariencia">Apariencia</span></h2>
<figure class="thumb tleft " style="width: 100px">
<p>...</p>
<p>...</p>
<p>...</p>
<h2><span class="mw-headline" id="Personalidad">Personalidad</span></h2>
<p>...</p>
<p>...</p>
<p>...</p>
I need to obtain the text in those paragraphs.
I have tried something like this but it obviously does not work.
import urllib.request
from bs4 import BeautifulSoup
fp = urllib.request.urlopen("https://jojo.fandom.com/es/wiki/Star_Platinum")
mybytes = fp.read()
html_doc = mybytes.decode("utf8")
fp.close()
soup = BeautifulSoup(html_doc, 'html.parser')
spans = soup.find_all('span', {"class": "mw-headline"})
for s in spans:
print(s.nextSibling.getText)
You can search backwards for previous <h2>
and store result in the dictionary:
from bs4 import BeautifulSoup
html_doc = '''
<h2><span class="mw-headline" id="Apariencia">Apariencia</span></h2>
<figure class="thumb tleft " style="width: 100px">
<p>T1 ...</p>
<p>T2 ...</p>
<p>T3 ...</p>
<h2><span class="mw-headline" id="Personalidad">Personalidad</span></h2>
<p>T4 ...</p>
<p>T5 ...</p>
<p>T6 ...</p>'''
soup = BeautifulSoup(html_doc, 'html.parser')
out = {}
for p in soup.select('p'):
previous_h2 = p.find_previous('h2')
out.setdefault(previous_h2.text, []).append(p.text)
print(out)
Prints:
{
'Apariencia': ['T1 ...', 'T2 ...', 'T3 ...'],
'Personalidad': ['T4 ...', 'T5 ...', 'T6 ...']
}