How to get value inside of h3 tag with beautifulsoup in python?
Question:
I ‘m trying to get value inside of h3
tag
but there is a problem that I dont figure out. I m stuck in a problem.
this is the data which I want to get. I want to get Zahlen
word between span
classes
inside h3
tag
but I couldnt figure out how can I do this in python with beautifulsoup.
<h3>
<span class="ez-toc-section" id="Zahlen"></span>
Zahlen
<span class="ez-toc-section-end"></span>
</h3>
I’m trying to create a dictionary dataset. This dictionary dataset will be a german dictionary and words will be separated into categories. As I mentioned above, Zahlen is an identifier and this identifier will have words. There will be other words for other h3 tags
. For example, this is the code I wrote.
result = requests.get(url)
soup = BeautifulSoup(result.text, 'html.parser')
div = soup.find('div', class_='entry-content clearfix')
for word in div.find_all('table', class_='table table-bordered'):
for word1 in word.find_all('tbody'):
rows = word1.find_all('tr')
for row in rows:
each_word = row.find_all('td')
case = {
"index": each_word[0].string,
"word": each_word[1].string,
"meaning": each_word[2].string
}
list.append(case)
with open('DictionaryWords.json', 'w', encoding='utf-8') as f:
json.dump(list, f, ensure_ascii=False, indent=4)
and for example result:
[
{
"index": "1.",
"word": "Hallo",
"meaning": "Merhaba"
},
{
"index": "2.",
"word": "Herzlich willkommen",
"meaning": "Hoş geldiniz"
},
{
"index": "3.",
"word": "Auf Wiedersehen",
"meaning": "Hoşça kalın"
},
{
"index": "4.",
"word": "Guten Morgen",
"meaning": "Günaydın"
},
{
"index": "5.",
"word": "Haben Sie einen guten Tag",
"meaning": "İyi günler"
}
the data that I want this.
[
{
"zahlen":[
{
"index": "1.",
"word": "Hallo",
"meaning": "Merhaba"
},
{
"index": "2.",
"word": "Herzlich willkommen",
"meaning": "Hoş geldiniz"
},
{
"index": "3.",
"word": "Auf Wiedersehen",
"meaning": "Hoşça kalın"
}
]
}
]
Answers:
You could find the titles with div.find_all('h3')
and then zip them with the list of tables. But your loop to get the tables didn’t work for me, so I rewrote your code using pandas’ read_html, which does table extraction using BeautifulSoup under the hood:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://almancakonulari.com/a1-seviye-almanca-kelimeler/#gsc.tab=0'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
# restrict search to div with class entry-content clearfix
div = soup.find('div', attrs={'class': 'entry-content clearfix'})
# get all h3
h3 = div.find_all('h3')
h3 = [' '.join(i.get_text().strip().split()) for i in h3] # clean text
# get all tables with pandas
df = pd.read_html(page.content)
df = df[-len(h3):] # only keep relevant tables, i.e. the last n tables where n == len(h3)
# rename all columns of the tables
for i in df:
i.columns = ['index', 'word', 'meaning']
# create output dict
output_dict = {h3[n]: i.to_dict(orient='records') for n, i in enumerate(df)}
I ‘m trying to get value inside of h3
tag
but there is a problem that I dont figure out. I m stuck in a problem.
this is the data which I want to get. I want to get Zahlen
word between span
classes
inside h3
tag
but I couldnt figure out how can I do this in python with beautifulsoup.
<h3>
<span class="ez-toc-section" id="Zahlen"></span>
Zahlen
<span class="ez-toc-section-end"></span>
</h3>
I’m trying to create a dictionary dataset. This dictionary dataset will be a german dictionary and words will be separated into categories. As I mentioned above, Zahlen is an identifier and this identifier will have words. There will be other words for other h3 tags
. For example, this is the code I wrote.
result = requests.get(url)
soup = BeautifulSoup(result.text, 'html.parser')
div = soup.find('div', class_='entry-content clearfix')
for word in div.find_all('table', class_='table table-bordered'):
for word1 in word.find_all('tbody'):
rows = word1.find_all('tr')
for row in rows:
each_word = row.find_all('td')
case = {
"index": each_word[0].string,
"word": each_word[1].string,
"meaning": each_word[2].string
}
list.append(case)
with open('DictionaryWords.json', 'w', encoding='utf-8') as f:
json.dump(list, f, ensure_ascii=False, indent=4)
and for example result:
[
{
"index": "1.",
"word": "Hallo",
"meaning": "Merhaba"
},
{
"index": "2.",
"word": "Herzlich willkommen",
"meaning": "Hoş geldiniz"
},
{
"index": "3.",
"word": "Auf Wiedersehen",
"meaning": "Hoşça kalın"
},
{
"index": "4.",
"word": "Guten Morgen",
"meaning": "Günaydın"
},
{
"index": "5.",
"word": "Haben Sie einen guten Tag",
"meaning": "İyi günler"
}
the data that I want this.
[
{
"zahlen":[
{
"index": "1.",
"word": "Hallo",
"meaning": "Merhaba"
},
{
"index": "2.",
"word": "Herzlich willkommen",
"meaning": "Hoş geldiniz"
},
{
"index": "3.",
"word": "Auf Wiedersehen",
"meaning": "Hoşça kalın"
}
]
}
]
You could find the titles with div.find_all('h3')
and then zip them with the list of tables. But your loop to get the tables didn’t work for me, so I rewrote your code using pandas’ read_html, which does table extraction using BeautifulSoup under the hood:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://almancakonulari.com/a1-seviye-almanca-kelimeler/#gsc.tab=0'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
# restrict search to div with class entry-content clearfix
div = soup.find('div', attrs={'class': 'entry-content clearfix'})
# get all h3
h3 = div.find_all('h3')
h3 = [' '.join(i.get_text().strip().split()) for i in h3] # clean text
# get all tables with pandas
df = pd.read_html(page.content)
df = df[-len(h3):] # only keep relevant tables, i.e. the last n tables where n == len(h3)
# rename all columns of the tables
for i in df:
i.columns = ['index', 'word', 'meaning']
# create output dict
output_dict = {h3[n]: i.to_dict(orient='records') for n, i in enumerate(df)}