How to get value inside of h3 tag with beautifulsoup in python?

Question:

I ‘m trying to get value inside of h3 tag but there is a problem that I dont figure out. I m stuck in a problem.

this is the data which I want to get. I want to get Zahlen word between span classes inside h3 tag but I couldnt figure out how can I do this in python with beautifulsoup.

<h3>
  <span class="ez-toc-section" id="Zahlen"></span>
     Zahlen
  <span class="ez-toc-section-end"></span>
</h3>

I’m trying to create a dictionary dataset. This dictionary dataset will be a german dictionary and words will be separated into categories. As I mentioned above, Zahlen is an identifier and this identifier will have words. There will be other words for other h3 tags. For example, this is the code I wrote.

result = requests.get(url)
soup = BeautifulSoup(result.text, 'html.parser')
div = soup.find('div', class_='entry-content clearfix')


for word in div.find_all('table', class_='table table-bordered'):
    for word1 in word.find_all('tbody'):
        rows = word1.find_all('tr')
        for row in rows:
            each_word = row.find_all('td')
            case = {
                "index": each_word[0].string,
                "word": each_word[1].string,
                "meaning": each_word[2].string
            }
            list.append(case)

with open('DictionaryWords.json', 'w', encoding='utf-8') as f:
    json.dump(list, f, ensure_ascii=False, indent=4)

and for example result:

[
    {
        "index": "1.",
        "word": "Hallo",
        "meaning": "Merhaba"
    },
    {
        "index": "2.",
        "word": "Herzlich willkommen",
        "meaning": "Hoş geldiniz"
    },
    {
        "index": "3.",
        "word": "Auf Wiedersehen",
        "meaning": "Hoşça kalın"
    },
    {
        "index": "4.",
        "word": "Guten Morgen",
        "meaning": "Günaydın"
    },
    {
        "index": "5.",
        "word": "Haben Sie einen guten Tag",
        "meaning": "İyi günler"
    }

the data that I want this.

  [
     {
      "zahlen":[
        {
            "index": "1.",
            "word": "Hallo",
            "meaning": "Merhaba"
        },
        {
            "index": "2.",
            "word": "Herzlich willkommen",
            "meaning": "Hoş geldiniz"
        },
        {
            "index": "3.",
            "word": "Auf Wiedersehen",
            "meaning": "Hoşça kalın"
        }
      ]
     }
    ]

  
Asked By: NewPartizal

||

Answers:

You could find the titles with div.find_all('h3') and then zip them with the list of tables. But your loop to get the tables didn’t work for me, so I rewrote your code using pandas’ read_html, which does table extraction using BeautifulSoup under the hood:

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = 'https://almancakonulari.com/a1-seviye-almanca-kelimeler/#gsc.tab=0'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

# restrict search to div with class entry-content clearfix
div = soup.find('div', attrs={'class': 'entry-content clearfix'})

# get all h3
h3 = div.find_all('h3')
h3 = [' '.join(i.get_text().strip().split()) for i in h3] # clean text

# get all tables with pandas
df = pd.read_html(page.content)
df = df[-len(h3):] # only keep relevant tables, i.e. the last n tables where n == len(h3)

# rename all columns of the tables
for i in df:
    i.columns = ['index', 'word', 'meaning']

# create output dict
output_dict = {h3[n]: i.to_dict(orient='records') for n, i in enumerate(df)}
Answered By: RJ Adriaansen
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.