Python HTML parsing getting only first element without childs

Question:

So basically I’m trying to parse a website and the result is not the whole html, just some elements. The webpage I’m trying to parse to practice is https://dre.pt/dre/detalhe/anuncio-concurso-urgente/491-2022-201313437 (inspect webpage to check html).

At first, tried parsing with BeatifulSoup4, but since I was not getting all the html I tried doing it with the requests package. This is the code I used:

import requests

page = requests.get(https://dre.pt/dre/detalhe/anuncio-concurso-urgente/491-2022-201313437)
print(page.text)

And this is the result of the print:

<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv="X-UA-Compatible" content="IE=edge" />
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
        <meta name="format-detection" content="telephone=no" />
        <script type='text/javascript'>window.OutSystemsApp = { basePath: '/dre/' };</script>
        <meta http-equiv="Content-Security-Policy" content="base-uri 'self'; child-src * gap:; frame-src * gap:; connect-src *; default-src 'self' 'unsafe-inline' *.google-analytics.com *.hotjar.com *.googletagmanager.com *.dre.pt *.hotjar.io *.doubleclick.net *.knightlab.com *.google.com *.google.pt gap: 'unsafe-inline' 'unsafe-eval'; font-src 'self' data:; img-src * blob:; script-src 'unsafe-inline' * 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; frame-ancestors *.incm.pt *.dre.pt 'self' gap:; report-uri /SecurityUtils/rest/Report/ReportViolations?Params=6ynyp6xeJpZnfxI6yAFDPT6aME%2BSwVNSjUa7DbEWDtGj%2BeYOS2vlfnuZj5cBszGf0z2tYxgp5XrpqwJyMUUnTw%3D%3D; " />
<meta http-equiv="X-Content-Security-Policy" content="base-uri 'self'; child-src * gap:; frame-src * gap:; connect-src *; default-src 'self' 'unsafe-inline' *.google-analytics.com *.hotjar.com *.googletagmanager.com *.dre.pt *.hotjar.io *.doubleclick.net *.knightlab.com *.google.com *.google.pt gap: 'unsafe-inline' 'unsafe-eval'; font-src 'self' data:; img-src * blob:; script-src 'unsafe-inline' * 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; frame-ancestors *.incm.pt *.dre.pt 'self' gap:; report-uri /SecurityUtils/rest/Report/ReportViolations?Params=6ynyp6xeJpZnfxI6yAFDPT6aME%2BSwVNSjUa7DbEWDtGj%2BeYOS2vlfnuZj5cBszGf0z2tYxgp5XrpqwJyMUUnTw%3D%3D; " />
<meta http-equiv="X-WebKit-CSP" content="base-uri 'self'; child-src * gap:; frame-src * gap:; connect-src *; default-src 'self' 'unsafe-inline' *.google-analytics.com *.hotjar.com *.googletagmanager.com *.dre.pt *.hotjar.io *.doubleclick.net *.knightlab.com *.google.com *.google.pt gap: 'unsafe-inline' 'unsafe-eval'; font-src 'self' data:; img-src * blob:; script-src 'unsafe-inline' * 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; frame-ancestors *.incm.pt *.dre.pt 'self' gap:; report-uri /SecurityUtils/rest/Report/ReportViolations?Params=6ynyp6xeJpZnfxI6yAFDPT6aME%2BSwVNSjUa7DbEWDtGj%2BeYOS2vlfnuZj5cBszGf0z2tYxgp5XrpqwJyMUUnTw%3D%3D; " />

        
        <meta name="viewport" content="viewport-fit=cover, width=device-width, initial-scale=1" />
<script type="text/javascript">
(function () {
    function appendMetaTagAttributes(metaTag, attribute, values) {
        var elem = document.querySelector("meta[name=" + metaTag + "]");

        if (elem) {
            var attrContent = elem.getAttribute(attribute);
            elem.setAttribute(attribute, (attrContent ? attrContent + "," : "") + values.join(","));
        }
    }

    if (navigator && /OutSystemsApp/i.test(navigator.userAgent)) {
        // If this app is running on the native shell, we want to disable the zoom
        appendMetaTagAttributes("viewport", "content", ["user-scalable=no", "minimum-scale=1.0"]);
    }
})();</script>

        <script type="text/javascript" src="/dre/scripts/OutSystemsManifestLoader.js?Uno7DkMuu4+3RKfSExTIUg"></script>
<script type="text/javascript" src="/dre/scripts/OutSystems.js?eq9LGmzdgJbMq6dbNoVMvQ"></script>
<script type="text/javascript" src="/dre/scripts/OutSystemsReactView.js?DtzHEvOePADFJAVO+XYgVg"></script>
<script type="text/javascript" src="/dre/scripts/cordova.js?7KqI9_oL9hClomz1RdzTqg"></script>
<script type="text/javascript" src="/dre/scripts/NullDebugger.js?ivgNSF0_ZARULD3LtoI2HA"></script>
<script type="text/javascript" src="/dre/scripts/DRE.appDefinition.js?otw_Nv9Nr+Q7EbWK92qVcw"></script>
<script type="text/javascript" src="/dre/scripts/OutSystemsReactWidgets.js?E4SSw3FwbHWsyMMUPr64mg"></script>
<link type="text/css" rel="stylesheet" href="/dre/css/_Basic.css?EqGzAe81QbZLXJyfY3oLwA"></link>

        <script type="text/javascript">OSManifestLoader.indexVersionToken = "o1jW2zcC1fbJz1ZC8XfM8g";
</script>
    </head>
    <body>
        <div id="reactContainer"></div>
        <noscript><span>JavaScript is required</span></noscript>

        <script type="text/javascript" src="/dre/scripts/DRE.index.js?lmc__VYpxdf3u6qXtN7E9w"></script>

    </body>
</html>

As you can see I’m getting all the head from the html, but in the body section I only recieve the reactContainer div but none of it’s childs, where the information is, how should I get the html to get the childs too?

Asked By: NukeSkull

||

Answers:

If you’re willing to give selenium another try, I use this function in several of my scraping projects:

# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC

def linkToSoup_selenium(l, ecx=None, clickFirst=None):
    try:
        driver = webdriver.Chrome('chromedriver.exe')
        # I copy chromedriver.exe to the same folder as this py file

        driver.get(l) # go to link

        # if something needs to be confirmed by click
        if clickFirst:
            WebDriverWait(driver, 25).until(
                EC.element_to_be_clickable((By.XPATH, clickFirst))
            )
            driver.find_element(By.XPATH, clickFirst).click()

        # if some section needs to be loaded first
        if ecx:
            WebDriverWait(driver, 25).until(
                EC.visibility_of_all_elements_located((By.XPATH, ecx)))

        lSoup = BeautifulSoup(driver.page_source, 'html.parser')
        driver.close()
        del driver
        return lSoup
    except Exception as e:
        print(str(e))
        return None

It’s been useful for several sites protected by blockers like cloudflare, or sites like the one in your question where parts are rendered after the initial load, or where you might have to click some kind of pop up about age/cookies/etc first.


In your case, you can use the function to get the bs4 tree after your target section has been loaded. For example if you needed something from the "TEXTO" section
enter image description here

you could call the function with the xpath for any paragraph in that section:

soup = linkToSoup_selenium(
    l = 'https://dre.pt/dre/detalhe/anuncio-concurso-urgente/491-2022-201313437',
    ecx = '//div[@id="transitionContainer"]//div[@data-container]//div[@data-container]/p'
    # clickFirst argument is not necessary for this site
)

[If you need it, this is a good cheatsheet for xpath.]

Once you have soup, you can extract the data you need from it – for example,

if soup is not None: 
    numbered = soup.find_all(lambda t: t.name == 'p' and
        t.text.count(' - ') > 0 and t.text.split(' - ')[0].isdigit())
        
    for n in numbered:
        print(n.get_text(strip=True))

will output

1 - IDENTIFICAÇÃO E CONTACTOS DA ENTIDADE ADJUDICANTE
2 - OBJETO DO CONTRATO
3 - INDICAÇÕES ADICIONAIS
4 - LOCAL DA EXECUÇÃO DO CONTRATO
5 - DIVISÃO EM LOTES, SE FOR O CASO
6 - PRAZO DE EXECUÇÃO DO CONTRATO
7 - DOCUMENTOS DE HABILITAÇÃO
8 - CONDIÇÕES DE PARTICIPAÇÃO
9 - ACESSO ÀS PEÇAS DO CONCURSO E APRESENTAÇÃO DAS PROPOSTAS
10 - PRAZO PARA APRESENTAÇÃO DAS PROPOSTAS
11 - CRITÉRIO DE ADJUDICAÇÃO
12 - IDENTIFICAÇÃO E CONTACTOS DO ÓRGÃO DE RECURSO ADMINISTRATIVO
13 - DATA E HORA DE ENVIO DO ANÚNCIO PARA PUBLICAÇÃO NO DIÁRIO DA REPÚBLICA
14 - IDENTIFICAÇÃO DO(S) AUTOR(ES) DO ANÚNCIO
Answered By: Driftr95