How to download PDFs from a list of URLs using the wget module?
Question:
I have a Python script which scrapes URLs from a website with Selenium and stores them in a list. Now, I would like to download them with the wget
module.
This is the relevant part of the code, where the script completes the partial URLs obtained from the website:
new_links = []
for link in list_of_links: # trim links
current_strings = link.split("/consultas/coleccion/window.open('")
current_strings[1] = current_strings[1].split("');return")[0]
new_link = current_strings[0] + current_strings[1]
new_links.append(new_link)
for new_link in new_links:
wget.download(new_link)
The script doesn’t do anything at this point. It never downloads any PDFs and displays no error message.
What did I do wrong in the second for
loop?
As for the question whether new_links
is empty. It is not.
print(*new_links, sep = 'n')
gives me links like these (here just four of many):
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=vPjrUnz0wbA%3D
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=dsyx6l1Fbig%3D
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=Cb64W7EHlD8%3D
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=A4TKEG9x4F8%3D
A partial URL looks like:
/consultas/util/pdf.php?type=rdd&rdd=vPjrUnz0wbA%3D`
Then the "base URL" is added before it
http://digesto.asamblea.gob.ni
This is the relevant part of the code, which just comes before the code above, where it collects the partial URLs:
list_of_links = [] # will hold the scraped links
tld = 'http://digesto.asamblea.gob.ni'
current_url = driver.current_url # for any links not starting with /
table_id = driver.find_element(By.ID, 'tableDocCollection')
rows = table_id.find_elements_by_css_selector("tbody tr") # get all table rows
for row in rows:
row.find_element_by_css_selector('button').click()
link = row.find_element_by_css_selector('li a[onclick*=pdf]').get_attribute("onclick") # get partial link
if link.startswith('/'):
list_of_links.append(tld + link) # add base to partial link
else:
list_of_links.append(current_url + link)
row.find_element_by_css_selector('button').click()
Answers:
Your loop is working.
Try upgrading your wget version to 3.2 and check
new_links = ['http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=vPjrUnz0wbA%3D',
'http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=dsyx6l1Fbig%3D',
'http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=Cb64W7EHlD8%3D',
'http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=A4TKEG9x4F8%3D']
for new_link in new_links:
wget.download(new_link)
Output: four files got downloaded with filenames pdf.php, pdf(1).php, etc.
I have a Python script which scrapes URLs from a website with Selenium and stores them in a list. Now, I would like to download them with the wget
module.
This is the relevant part of the code, where the script completes the partial URLs obtained from the website:
new_links = []
for link in list_of_links: # trim links
current_strings = link.split("/consultas/coleccion/window.open('")
current_strings[1] = current_strings[1].split("');return")[0]
new_link = current_strings[0] + current_strings[1]
new_links.append(new_link)
for new_link in new_links:
wget.download(new_link)
The script doesn’t do anything at this point. It never downloads any PDFs and displays no error message.
What did I do wrong in the second for
loop?
As for the question whether new_links
is empty. It is not.
print(*new_links, sep = 'n')
gives me links like these (here just four of many):
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=vPjrUnz0wbA%3D
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=dsyx6l1Fbig%3D
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=Cb64W7EHlD8%3D
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=A4TKEG9x4F8%3D
A partial URL looks like:
/consultas/util/pdf.php?type=rdd&rdd=vPjrUnz0wbA%3D`
Then the "base URL" is added before it
http://digesto.asamblea.gob.ni
This is the relevant part of the code, which just comes before the code above, where it collects the partial URLs:
list_of_links = [] # will hold the scraped links
tld = 'http://digesto.asamblea.gob.ni'
current_url = driver.current_url # for any links not starting with /
table_id = driver.find_element(By.ID, 'tableDocCollection')
rows = table_id.find_elements_by_css_selector("tbody tr") # get all table rows
for row in rows:
row.find_element_by_css_selector('button').click()
link = row.find_element_by_css_selector('li a[onclick*=pdf]').get_attribute("onclick") # get partial link
if link.startswith('/'):
list_of_links.append(tld + link) # add base to partial link
else:
list_of_links.append(current_url + link)
row.find_element_by_css_selector('button').click()
Your loop is working.
Try upgrading your wget version to 3.2 and check
new_links = ['http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=vPjrUnz0wbA%3D',
'http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=dsyx6l1Fbig%3D',
'http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=Cb64W7EHlD8%3D',
'http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=A4TKEG9x4F8%3D']
for new_link in new_links:
wget.download(new_link)
Output: four files got downloaded with filenames pdf.php, pdf(1).php, etc.