How to iteratively retrieve the right information from beautiful soup elements?

Question:

I try to retrieve information from EZB press releases. To do so I use BeautifulSoup. Since the structure (HTML) of the press releases is changing over time, it is difficult to retrieve the date of the press releases with a single selector. Hence I tried to use "try and except" as well as "if/else statements" to retrieve the date from all HTML files. Unfortunately, my code does not work the way I want it to work since I do not get the adequate dates from all press releases.

Does anybody know how to iterate through multiple soup elements and choose the right element to select the date from the respective HTML file?

Attached my code:

from pandas.core.internals.managers import ensure_block_shape
import bs4, requests

pr_list = []

def parseContent(Urls):
  for x in Urls:
   res = requests.get(x)
   article = bs4.BeautifulSoup(res.text, 'html.parser')
   try:
    date = article.select('#main-wrapper > main > div.section > p.ecb-publicationDate')
    if date:
      for x in date:
        date = x.text.strip()   
    date = article.select('#main-wrapper > main > div.ecb-pressContentPubDate')
    if date:
      for x in date:
          date = x.text.strip()     
    else:
      date = article.select('#main-wrapper > main > div.title > ul > li.ecb-publicationDate')
      for x in date:
          date = x.text.strip()
   except:
    date = None
   try:
    title = article.select('#main-wrapper > main > div.title > h1')
    for x in title:
      title = x.text.strip()
   except:
    title = None
   try:
    body = article.select("#main-wrapper > main > div.section")
    for x in body:
      body = x.text.strip()
   except:
    body = None
   row = [date,title,body]
   pr_list.append(row)
Asked By: Nick

||

Answers:

Store your match expressions in a list and then iterate over them until one is successful:

import bs4
import requests


date_expressions = [
    "#main-wrapper > main > div.section > p.ecb-publicationDate",
    "#main-wrapper > main > div.ecb-pressContentPubDate",
    "#main-wrapper > main > div.title > ul > li.ecb-publicationDate",
]

title_expressions = [
    "#main-wrapper > main > div.title > h1",
]

body_expressions = [
    "#main-wrapper > main > div.section",
]


def try_several_expressions(article, expressions):
    """Try to match an element using the given list of expressions.

    Raise ValueError if we failed to find any matches or if we find
    multiple matches.
    """

    for expr in expressions:
        res = article.select(expr)
        if res:
            break
    else:
        raise ValueError("failed to match any expressions")

    if len(res) > 1:
        raise ValueError("failed to match a unique value")

    return res[0]


def parseContent(urls):
    pr_list = []
    for url in urls:
        res = requests.get(url)
        article = bs4.BeautifulSoup(res.text, "html.parser")
        date = try_several_expressions(article, date_expressions).text
        title = try_several_expressions(article, title_expressions).text
        body = try_several_expressions(article, body_expressions).text

        row = [date, title, body]
        pr_list.append(row)

    return pr_list

Assuming that you mean "ECB" rather than "EZB", I tested this against https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230710~77cf718c59.en.html and it seems to work as expected.

Answered By: larsks

Improved your code as follows:

  • Removed unnecessary try-except blocks
  • Reduced complex logic and selectors and replaced them with static selectors and regex-based dynamic selectors.
from bs4 import BeautifulSoup
from pprint import pprint
import re
import requests

pr_list = []

urls = [
    'https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230710~77cf718c59.en.html',
    'https://www.ecb.europa.eu/press/pr/date/2012/html/pr120912_1.en.html'
]

def parse_content(urls):
    for url in urls:
        print(url)
        res = requests.get(url)
        page = BeautifulSoup(res.text, 'html.parser')

        # initializing default values
        row = [None ,None ,None]
        
        #for dates
        if page.find('main').find(attrs={'class': re.compile('Date')}, string=re.compile('d+ (January|February|March|April|May|June|July|August|September|October|November|December) d{4}')):
            row[0] = page.find('main').find(attrs={'class': re.compile('Date')}, string=re.compile('d+ (January|February|March|April|May|June|July|August|September|October|November|December) d{4}')).text.strip()
        
        
        # getting title
        row[1] = page.find('div', {'class': 'title'}).find('h1').text.strip() if page.find('div', {'class': 'title'}) and page.find('div', {'class': 'title'}).find('h1') else None
        
        # getting body
        row[2] = page.find('main').find('div', {'class': 'section'}).text.strip() if page.find('div', {'class': 'section'}) else None
        
        pr_list.append(row)


parse_content(urls)
pprint(pr_list)

Note that I used regex to find dates, since dates were following this pattern in the examples that you had provided, along with having Date in their class names, in the main tag.

Output is

https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230710~77cf718c59.en.html
https://www.ecb.europa.eu/press/pr/date/2012/html/pr120912_1.en.html
[['10 July 2023',
  'ECB surveys Europeans on new themes for euro banknotes',
  '10 July 2023Europeans invited to express preferences on shortlisted themes '
  'in public survey open until 31xa0August 2023ECB’s Governing Council '
  'expected to choose future theme by 2024, and final designs in 2026The '
  'European Central Bank (ECB) is asking European citizens about their views '
  'on the proposed themes for the next series of euro banknotes. From 10 July '
  'until 31 August 2023 everybody in the euro area can respond to a survey on '
  'the ECB’s website. In addition, to ensure opinions from across the euro '
  'area are equally represented, the ECB has contracted an independent '
  'research company to ask a representative sample of people in the euro area '
  'the same questions as those in its own survey.ECB President Christine '
  'Lagarde invites everybody to participate in the survey. She said “There is '
  'a strong link between our single currency and our shared European identity, '
  'and our new series of banknotes should emphasise this. We want Europeans to '
  'identify with the design of euro banknotes, which is why they will play an '
  'active role in selecting the new theme.”Developing our future euro '
  'banknotes“We are working on a new series of high-tech banknotes with a view '
  'to preventing counterfeiting and reducing environmental impact,” said '
  'Executive Board member Fabio Panetta. “We are committed to cash and to '
  'ensuring that paying with public money is always an option.”It is the duty '
  'of the ECB and the euro area national central banks to ensure euro '
  'banknotes remain an innovative, secure and efficient means of payment. '
  'Developing new series of banknotes is a standard practice for all central '
  'banks. In a world where reproduction technologies are rapidly evolving and '
  'where counterfeiters can easily access information and materials, it is '
  'necessary to issue new banknotes on a regular basis. Beyond security '
  'considerations, the ECB is committed to reducing the environmental impact '
  'of euro banknotes throughout their life cycle, while also making them more '
  'relatable and inclusive for Europeans of all ages and backgrounds, '
  'including vulnerable groups such as people with visual '
  'impairment.Shortlisted themes for future banknotesThe seven themes '
  'shortlisted by the ECB’s Governing Council are listed below.[1]Birds: free, '
  'resilient, inspiringBirds know nothing of national borders and symbolise '
  'freedom of movement. Their nests remind us of our own desire to build '
  'places and societies that nurture and protect the future. They remind us '
  'that we share our continent with all the lifeforms that sustain our common '
  'existence.European cultureEurope’s rich cultural heritage and dynamic '
  'cultural and creative sectors strengthen the European identity, forging a '
  'shared sense of belonging. Culture promotes common values, inclusion and '
  'dialogue in Europe and across the globe. It brings people together.European '
  'values mirrored in natureEurope is a living place, but also an idea. The '
  'European Union is an organisation, but also a set of values. The theme '
  'highlights the role of European values (human dignity, freedom, democracy, '
  'equality, the rule of law and human rights) as the building blocks of '
  'Europe and links these values to our respect for nature and the '
  'preservation of the environment.The future is yoursThe ideas and '
  'innovations that will shape the future of Europe lie deep within every '
  'European. The images created for this theme represent the bearers of the '
  'collective imagination through which people will create this shared future. '
  'This theme signifies the boundless potential of Europeans.Hands: together '
  'we build EuropeHands are familiar to all of us but no two pairs are the '
  'same. Hands built Europe, its physical infrastructure, its artistic '
  'heritage and its achievements. Hands build, weave, heal, teach, connect and '
  'guide us. Hands tell stories of labour, age and relationships, of heritage, '
  'history, and culture. This theme celebrates the hands that have built '
  'Europe and continue to do so every day.xa0Our Europe, ourselvesWe grow up '
  'as individuals but also as part of a community, through our relationships '
  'with one another. We have our own stories and identities, but we also share '
  'a common identity as Europeans. This theme evokes the freedom, values and '
  "openness of people in Europe.Rivers: the waters of life in EuropeEurope's "
  'rivers cross borders. They connect us to each other and to nature. They '
  'represent the ebb and flow of a dynamic, ever-changing continent. They '
  'nurture us and remind us of the deep sources of our common life, and we '
  'must nurture them in turn.The shortlist of themes takes into account the '
  'suggestions made by a multidisciplinary advisory group, with members from '
  'all euro area countries.Timeline for the new designsThe outcome of the '
  'surveys will be used by the ECB to select the theme for the next generation '
  'of banknotes by 2024. After that a design competition will take place. '
  'European citizens will again have the chance to express their preferences '
  'on the design options resulting from that competition. The ECB is expected '
  'to take the decision on the future design, and on when to produce and issue '
  'the new banknotes, in 2026.For media queries, please contact Belén Pérez '
  'Esteve, tel.: +49 173 533 4269.'],
 ['12 September 2012',
  'ECB extends the swap facility agreement u2028with the Bank of England',
  'The Governing Council of the European Central Bank (ECB) has decided, in '
  'agreement with the Bank of England, to extend the liquidity swap '
  'arrangement with the Bank of England up to u2028'
  '30 September 2013. The swap facility agreement established on 17 December '
  '2010 had been authorised until the end of September 2011 and then extended '
  'until 28 September 2012.n'
  'The related announcement by the Bank of England is available at their '
  'website http://www.bankofengland.co.uk.']]
Answered By: Zero
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.