scraping when there is on-click

Question

Hi I’m trying to scrape pdf files from this website.
I tried using beautiful soup and also lxml. But I get empty lists. Please let me know where I’m mistaking. There is also one on-click button.

my code.

r= requests.get('http://www.italgiure.giustizia.it/sncass/')
soup = BeautifulSoup(r.text, 'html.parser')

pdf_list = soup.find_all('a')
print(pdf_list)
search_html = html.fromstring(r.text)
page_link = search_html.xpath('//*[@id="contentData"]/div[2]/div[1]/div/h3/a/span[1]/span')
print(page_link)

results:

[<a href="accessibilita.html" style="text-decoration:none;font-size:80%;color:white" tabindex="0">Accessibilità</a>, <a accesskey="r" name="results" onclick="$(this).next().focus();" tabindex="-2" title="contenuto"></a>, <a accesskey="1" name="card" onclick="$(this).next().focus();" tabindex="-2" title="documento"></a>, <a class="text2pdf" href="javascript:void(0)" onclick="toTargetDoc($('.toDocument.pdf',$(this)).attr('data-arg'), this)" style="text-decoration:none;color:#440;" tabindex="0"> <span data-arg="filename" data-role="content" title="pdf"></span>  <span class="chkcontent"><span class="label">Sez.</span> <span class="risultato" data-arg="szdec" data-role="content"></span> <span class="risultato" data-arg="kind" data-role="content"></span><span class="chkcontent"> - <span class="risultato" data-arg="ssz" data-role="content"></span></span><span class="label">,</span> </span> <span data-arg="tipoprov" data-role="content"></span> <span class="chkcontent"><span class="chkcontent"><span class="label">n.</span><span data-arg="numcard" data-role="content"></span></span><span data-arg="numdec" data-role="content" style="display:none"></span><span data-arg="numdep" data-role="content" style="display:none"></span> <span class="chkcontent"><span class="label"> del </span><span data-arg="datdep" data-role="content"></span><span data-arg="ecli" data-role="content" style="font-weight:normal"></span><span data-arg="anno" data-role="content" style="display:none"></span><span class="label">,</span></span> </span> <span class="chkcontent"><span class="label">udienza del</span> <span data-arg="datdec" data-role="content"></span><span class="label">,</span></span> <span class="chkcontent"><span class="label">Presidente </span><span data-arg="presidente" data-role="content"></span> </span> <span class="chkcontent"><span class="label">Relatore </span><span data-arg="relatore" data-role="content"></span> </span> </a>, <a class="text2ocr" href="javascript:void(0)" onclick="toTargetText($('.toDocument.txt',$(this)).attr('data-arg'))" style="text-decoration:none;color:#440;" tabindex="0"> <span data-arg="testoocr" data-role="content" title="testo ocr"></span>  <span data-arg="ocr" data-role="datasubset"> <span data-arg="ocr" data-role="multivaluedcontent">snippet</span> </span> </a>, <a href="http://www.italgiure.giustizia.it" style="color:white;" tabindex="0">ItalgiureWeb</a>] []

In the above results I cannot retrieve the weblinks, which are in the <span class="toDocument pdf" data-arg="/xw...
I also tried giving span class, which is:

pdf_list = soup.find('span', {'class': 'toDocument pdf'})

the html is

<a href="javascript:void(0)" tabindex="0" onclick="toTargetDoc($('.toDocument.pdf',$(this)).attr('data-arg'), this)" style="text-decoration:none;color:#440;" class="text2pdf"> <span data-role="content" data-arg="filename" title="pdf"><span class="toDocument pdf" data-arg="/xway/application/nif/clean/hc.dll%3Fverbo%3Dattach%26db%3Dsnciv%26id%3D./20221107/snciv@s50@a2022@[email protected]"><img class="rowIcon" alt="formato pdf" src="pix/pdf.png"></span></span>&nbsp; <span class="chkcontent"><span class="label">Sez.</span>&nbsp;<span class="risultato" data-role="content" data-arg="szdec">QUINTA</span> <span class="risultato" data-role="content" data-arg="kind">CIVILE</span><span class="label">,</span> </span> <span data-role="content" data-arg="tipoprov">Ordinanza</span> <span class="chkcontent"><span class="chkcontent"><span class="label">n.</span><span data-role="content" data-arg="numcard">32765</span></span><span style="display:none" data-role="content" data-arg="numdec">32765</span><span style="display:none" data-role="content" data-arg="numdep"></span> <span class="chkcontent"><span class="label"> del </span><span data-role="content" data-arg="datdep">07/11/2022</span><span style="font-weight:normal" data-role="content" data-arg="ecli"> (ECLI:IT:CASS:2022:32765CIV)</span><span style="display:none" data-role="content" data-arg="anno">2022</span><span class="label">,</span></span> </span> <span class="chkcontent"><span class="label">udienza del</span>&nbsp;<span data-role="content" data-arg="datdec"><span style="font-weight:normal">19/10/2022</span></span><span class="label">,</span></span> <span class="chkcontent"><span class="label">Presidente </span><span data-role="content" data-arg="presidente">PAOLITTO LIBERATO</span>&nbsp;</span> <span class="chkcontent"><span class="label">Relatore </span><span data-role="content" data-arg="relatore">DELL'ORFANO ANTONELLA</span>&nbsp;</span> </a>

Please let me know how to approach this. Thanks in advance.

Asked By: Praveen Bushipaka

||

Source

Answer 1

The files come from a POST request and you need to mimic it to get the files.

For example:

import urllib.parse

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest",
}

query_url = "http://www.italgiure.giustizia.it/sncass/isapi/hc.dll/sn.solr/sn-collection/select?app.query="

payload = {
    "start": "0",
    "rows": "10",
    "q": "((kind:"snciv" OR kind:"snpen")) AND szdec:"F" AND anno:"2022"",
    "wt": "json",
    "indent": "off",
    "sort": "pd desc,numdec desc",
    "fl": "id,filename,szdec,kind,ssz,tipoprov,numcard,numdec,numdep,datdep,ecli,anno,datdec,presidente,relatore,testoocr,ocr",
    "hl": "true",
    "hl.snippets": "4",
    "hl.fragsize": "100",
    "hl.fl": "ocr",
    "hl.q": "nomatch AND szdec:"F" AND anno:"2022"",
    "hl.maxAnalyzedChars": "1000000",
    "hl.simple.pre": "<em class="hit">",
    "hl.simple.post": "</em>",
}

docs = (
    requests
    .post(query_url, headers=headers, data=payload).json()["response"]["docs"]
)

base_url = "http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id="
for doc in docs:
    print(f'{base_url}{doc["filename"][0].replace(".pdf", ".clean.pdf")}')

This will get you first 10 .pdfs for FERIALE -> 2022.

http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221103/snpen@sF0@a2022@n41566@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221021/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221019/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221019/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221019/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221012/snpen@sF0@a2022@n38545@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20220928/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20220928/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20220926/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20220926/snpen@sF0@a2022@[email protected]

In order to "select" menu items edit this field:

For example this gets you first 10 files for UNITE for 2017.

    "hl.q": "nomatch AND szdec:"U" AND anno:"2017""

If you wish to paginate the response, change the value of "start" to, say, 10 to get the next 10 docs.:

    "start": "10"

Answered By: baduker

scraping when there is on-click

Question:

Answers: