Downloading PDFs from CAG

Question

I am trying to download multiple PDFs from CAG website (link https://cag.gov.in/en/state-accounts-report?defuat_state_id=64). I am using the following code-

url='https://cag.gov.in/en/state-accounts-report?defuat_state_id=64'

response=requests.get(url)

response

soup=BeautifulSoup(response.text,'html.parser')

soup

for link in soup.select("a[href$='.pdf']"):
   
    print(link)

for link in soup.select("a[href$='.pdf']"):    
    
    filename = os.path.join(folder_location,link['href'].split('/')[-1])  

     
    with open(filename, 'wb') as f:

      f.write(requests.get(urljoin(url,link['href'])).content)

This is giving me all the PDFs from the whole page, I wish to download the PDF under the tab ‘Monthly Key Indicators’ only. Please suggest the necessary changes in the code to do that.

Asked By: ASaharan

||

Source

Answer 1

You could try narrowing down the tab from which the links are selected. The tab id can be found using

tabId = soup.find(
    lambda t: t.name == 'a' and t.get('href') and 
    t.get('href').startswith('#tab') and # just in case
    'Monthly Key Indicators' == t.get_text(strip=True)
).get('href')

(Or, if it’s always the same id, you can just set as tabId = "#tab-360". ) Then, you can just change your selection to

soup.select(f"{tabId} a[href$='.pdf']")

But aren’t you downloading the same file 3x with each report? You could alter your for-loop to only download from the links with "Download" as text:

pdfLinks = soup.select(f"{tabId} a[href$='.pdf']")
pdfLinks = [pl for pl in pdfLinks if pl.get_text(strip=True) == 'Download']
for link in pdfLinks:
  #download

Answered By: PerpetuallyConfused

Downloading PDFs from CAG

Question:

Answers: