How to download PDF files with Playwright? (Python)
Question:
I’m trying to automate the download of a PDF file using Playwright, I’ve the code working with Selenium, but some features in Playwright got my attention. The real problem the documentation
isn’t helpful. When I click on download I get this:
And I cant change the directory of the download, it also delete the "file" when the browser/context are closed. Using Playwright I can achieve a nice download automation?
Code:
def run(playwright):
browser = playwright.chromium.launch(headless=False)
context = browser.new_context(accept_downloads=True)
# Open new page
page = context.new_page()
# Go to http://xcal1.vodafone.co.uk/
page.goto("http://xcal1.vodafone.co.uk/")
# Click text=Extra Small File 5 MB A high quality 5 minute MP3 music file 30secs @ 2 Mbps 10s >> img
with page.expect_download() as download_info:
page.click("text=Extra Small File 5 MB A high quality 5 minute MP3 music file 30secs @ 2 Mbps 10s >> img")
download = download_info.value
path = download.path()
download.save_as(path)
print(path)
# ---------------------
context.close()
browser.close()
with sync_playwright() as playwright:
run(playwright)
Answers:
The download.path()
in playwright is just a random GUID (globally unique identifier). It’s designed to validate the download works – not to keep the file.
Playwright is a testing tool and imagine running tests across every major browser on every code change – any downloads would quickly take up a lot of space and it would hack people off if you need to manually clear them out.
Good news is you are very close – If you want to keep the file you just need to give the file a name in the save_as
.
instead of this:
download.save_as(path)
use this:
download.save_as(download.suggested_filename)
That saves the file in the same location as the script.
You can save at any location with download.save_as(path)
This worked for me.
from pathlib import Path
...
download.save_as(Path.home().joinpath('Downloads', download.suggested_filename))
Its good for me:
url = config.url # your file url
response = await page_request.get(url, params={'id': file_id}) #your request
file = await response.body() # Downloaded file before save
file_name = filename.pdf # filename to be saved
open(file_name, 'wb').write(file)
print(f'File {file_name} is saved')
When I tried a similar code, I got the error:
playwright._impl._api_types.Error: net::ERR_ABORTED at https://www.africau.edu/images/default/sample.pdf
=========================== logs ===========================
navigating to "https://www.africau.edu/images/default/sample.pdf", waiting until "load"
============================================================
In retrospect, it’s likely because of the fact that I have set my playwright.chromium.launch_persistent_context(user_dir) to "always_open_pdf_externally:true" as in this example:
https://github.com/microsoft/playwright/issues/3509
In stead, what I needed to do was to use a try/except method like such:
async with page.expect_download() as download_info:
try:
await page.goto("https://www.africau.edu/images/default/sample.pdf", timeout= 5000)
except:
print("Saving file to ", downloads_path, file_name)
download = await download_info.value
print(await download.path())
await download.save_as(os.path.join(downloads_path, file_name))
await page.wait_for_timeout(200)
Maybe this helps someone.
It seems there isn’t a clean method for this, yet:
https://github.com/microsoft/playwright/issues/7822
I’m trying to automate the download of a PDF file using Playwright, I’ve the code working with Selenium, but some features in Playwright got my attention. The real problem the documentation
isn’t helpful. When I click on download I get this:
And I cant change the directory of the download, it also delete the "file" when the browser/context are closed. Using Playwright I can achieve a nice download automation?
Code:
def run(playwright):
browser = playwright.chromium.launch(headless=False)
context = browser.new_context(accept_downloads=True)
# Open new page
page = context.new_page()
# Go to http://xcal1.vodafone.co.uk/
page.goto("http://xcal1.vodafone.co.uk/")
# Click text=Extra Small File 5 MB A high quality 5 minute MP3 music file 30secs @ 2 Mbps 10s >> img
with page.expect_download() as download_info:
page.click("text=Extra Small File 5 MB A high quality 5 minute MP3 music file 30secs @ 2 Mbps 10s >> img")
download = download_info.value
path = download.path()
download.save_as(path)
print(path)
# ---------------------
context.close()
browser.close()
with sync_playwright() as playwright:
run(playwright)
The download.path()
in playwright is just a random GUID (globally unique identifier). It’s designed to validate the download works – not to keep the file.
Playwright is a testing tool and imagine running tests across every major browser on every code change – any downloads would quickly take up a lot of space and it would hack people off if you need to manually clear them out.
Good news is you are very close – If you want to keep the file you just need to give the file a name in the save_as
.
instead of this:
download.save_as(path)
use this:
download.save_as(download.suggested_filename)
That saves the file in the same location as the script.
You can save at any location with download.save_as(path)
This worked for me.
from pathlib import Path
...
download.save_as(Path.home().joinpath('Downloads', download.suggested_filename))
Its good for me:
url = config.url # your file url
response = await page_request.get(url, params={'id': file_id}) #your request
file = await response.body() # Downloaded file before save
file_name = filename.pdf # filename to be saved
open(file_name, 'wb').write(file)
print(f'File {file_name} is saved')
When I tried a similar code, I got the error:
playwright._impl._api_types.Error: net::ERR_ABORTED at https://www.africau.edu/images/default/sample.pdf
=========================== logs ===========================
navigating to "https://www.africau.edu/images/default/sample.pdf", waiting until "load"
============================================================
In retrospect, it’s likely because of the fact that I have set my playwright.chromium.launch_persistent_context(user_dir) to "always_open_pdf_externally:true" as in this example:
https://github.com/microsoft/playwright/issues/3509
In stead, what I needed to do was to use a try/except method like such:
async with page.expect_download() as download_info:
try:
await page.goto("https://www.africau.edu/images/default/sample.pdf", timeout= 5000)
except:
print("Saving file to ", downloads_path, file_name)
download = await download_info.value
print(await download.path())
await download.save_as(os.path.join(downloads_path, file_name))
await page.wait_for_timeout(200)
Maybe this helps someone.
It seems there isn’t a clean method for this, yet:
https://github.com/microsoft/playwright/issues/7822