python selenium, find out when a download has completed?
Question:
I’ve used selenium to initiate a download. After the download is complete, certain actions need to be taken, is there any simple method to find out when a download has complete? (I am using the FireFox driver)
Answers:
There is no built-in to selenium way to wait for the download to be completed.
The general idea here would be to wait until a file would appear in your “Downloads” directory.
This might either be achieved by looping over and over again checking for file existence:
Or, by using things like watchdog
to monitor a directory:
With Chrome, files which have not finished downloading have the extension .crdownload
. If you set your download directory properly, then you can wait until the file that you want no longer has this extension. In principle, this is not much different to waiting for file to exist (as suggested by alecxe) – but at least you can monitor progress in this way.
x1=0
while x1==0:
count=0
li = os.listdir("directorypath")
for x1 in li:
if x1.endswith(".crdownload"):
count = count+1
if count==0:
x1=1
else:
x1=0
This works if you are trying to check if a set of files(more than one) have finished downloading.
I came across this problem recently. I was downloading multiple files at once and had to build in a way to timeout if the downloads failed.
The code checks the filenames in some download directory every second and exits once they are complete or if it takes longer than 20 seconds to finish. The returned download time was used to check if the downloads were successful or if it timed out.
import time
import os
def download_wait(path_to_downloads):
seconds = 0
dl_wait = True
while dl_wait and seconds < 20:
time.sleep(1)
dl_wait = False
for fname in os.listdir(path_to_downloads):
if fname.endswith('.crdownload'):
dl_wait = True
seconds += 1
return seconds
I believe that this only works with chrome files as they end with the .crdownload extension. There may be a similar way to check in other browsers.
Edit: I recently changed the way that I use this function for times that .crdownload
does not appear as the extension. Essentially this just waits for the correct number of files as well.
def download_wait(directory, timeout, nfiles=None):
"""
Wait for downloads to finish with a specified timeout.
Args
----
directory : str
The path to the folder where the files will be downloaded.
timeout : int
How many seconds to wait until timing out.
nfiles : int, defaults to None
If provided, also wait for the expected number of files.
"""
seconds = 0
dl_wait = True
while dl_wait and seconds < timeout:
time.sleep(1)
dl_wait = False
files = os.listdir(directory)
if nfiles and len(files) != nfiles:
dl_wait = True
for fname in files:
if fname.endswith('.crdownload'):
dl_wait = True
seconds += 1
return seconds
As answered before, there is no native way to check if download is finished. So here is a helper function that does the job for Firefox and Chrome. One trick is to clear the temp download folder before start a new download. Also, use native pathlib for cross-platform usage.
from pathlib import Path
def is_download_finished(temp_folder):
firefox_temp_file = sorted(Path(temp_folder).glob('*.part'))
chrome_temp_file = sorted(Path(temp_folder).glob('*.crdownload'))
downloaded_files = sorted(Path(temp_folder).glob('*.*'))
if (len(firefox_temp_file) == 0) and
(len(chrome_temp_file) == 0) and
(len(downloaded_files) >= 1):
return True
else:
return False
I know its too late for the answer, though would like to share a hack for future readers.
You can create a thread say thread1 from main thread and initiate your download here.
Now, create some another thread, say thread2 and in there ,let it wait till thread1 completes using join() method.Now here,you can continue your flow of execution after download completes.
Still make sure you dont initiate your download using selenium, instead extract the link using selenium and use requests module to download.
Download using requests module
For eg:
def downloadit():
#download code here
def after_dwn():
dwn_thread.join() #waits till thread1 has completed executing
#next chunk of code after download, goes here
dwn_thread = threading.Thread(target=downloadit)
dwn_thread.start()
metadata_thread = threading.Thread(target=after_dwn)
metadata_thread.start()
this worked for me:
fileends = "crdownload"
while "crdownload" in fileends:
sleep(1)
for fname in os.listdir(some_path):
print(fname)
if "crdownload" in fname:
fileends = "crdownload"
else:
fileends = "None"
Check for "Unconfirmed" key word in file name in download directory:
# wait for download complete
wait = True
while(wait==True):
for fname in os.listdir('pathtodownload directory'):
if ('Unconfirmed') in fname:
print('downloading files ...')
time.sleep(10)
else:
wait=False
print('finished downloading all files ...')
As soon as the the filed download is completed it exits the while loop.
I got a better one though:
So redirect the function that starts the download. e.g. download_function= lambda: element.click()
than check number of files in directory and wait for a new file that doesnt have the download extension. After that rename it. (can be change to move the file instead of renaming it in the same directory)
def save_download(self, directory, download_function, new_name, timeout=30):
"""
Download a file and rename it
:param directory: download location that is set
:param download_function: function to start download
:param new_name: the name that the new download gets
:param timeout: number of seconds to wait for download
:return: path to downloaded file
"""
self.logger.info("Downloading " + new_name)
files_start = os.listdir(directory)
download_function()
wait = True
i = 0
while (wait or len(os.listdir(directory)) == len(files_start)) and i < timeout * 2:
sleep(0.5)
wait = False
for file_name in os.listdir(directory):
if file_name.endswith('.crdownload'):
wait = True
if i == timeout * 2:
self.logger.warning("Documents not downloaded")
raise TimeoutError("File not downloaded")
else:
self.logger.info("Downloading done")
new_file = [name for name in os.listdir(directory) if name not in files_start][0]
self.logger.info("New file found renaming " + new_file + " to " + new_name)
while not os.access(directory + r"\" + new_file, os.W_OK):
sleep(0.5)
self.logger.info("Waiting for write permission")
os.rename(directory + "\" + new_file, directory + "\" + new_name)
return directory + "\" + new_file
import os
import time
def latest_download_file(download_dir:'Downloads folder file path'):
os.chdir(download_dir)
files = sorted(os.listdir(os.getcwd()), key=os.path.getmtime)
latest_f = files[-1]
return latest_f
has_crdownload = True
while has_crdownload:
time.sleep(.6)
newest_file = latest_download_file()
if "crdownload" in newest_file:
has_crdownload = True
else:
has_crdownload = False
This is a combination of a few solutions. I didn’t like that I had to scan the entire downloads folder for a file ending in "crdownload". This code implements a function that pulls the newest file in downloads folder. Then it simply checks if that file is still being downloaded. Used it for a Selenium tool I am building worked very well.
If using Selenium and Chrome, you can write a custom wait condition such as:
class file_has_been_downloaded(object):
def __init__(self, dir, number):
self.dir = dir
self.number = number
def __call__(self, driver):
print(count_files(dir), '->', self.number)
return count_files(dir) > self.number
The function count_files just verifies that the file has been added to the folder
def count_files(direct):
for root, dirs, files in os.walk(direct):
return len(list(f for f in files if f.startswith('MyPrefix') and (
not f.endswith('.crdownload')) ))
Then to implement this in your code:
files = count_files(dir)
<< copy the file. Possibly use shutil >>
WebDriverWait(driver, 30).until(file_has_been_downloaded(dir, files))
create a function that uses "requests" to get the file content and call that one, your program will not move forward unless the file is downloaded
import requests
from selenium import webdriver
driver = webdriver.Chrome()
# Open the website
driver.get(website_url_)
x = driver.find_element_by_partial_link_text('download')
y = x.get_attribute("href")
fc = requests.get(y)
fname = x.text
with open(fname, 'wb') as f:
f.write(fc.content)
This is VERY SIMPLE and worked for me (and works fot any extention)
import os, glob and time (not truly needed)
# count how many files you have in Downloads folder before download
user = os.getlogin()
downloads_folder = (r"C:/Users/" + user + "/Downloads/")
files_path = os.path.join(downloads_folder, '*')
files = sorted(glob.iglob(files_path), key=os.path.getctime, reverse=True)
files_before_download = files
print(f'files before download: {len(files)}')
finished = False
# ...
# code
# to
# download
# file
# ...
# just for extra safety
time.sleep(0.5)
# wait for the download to finish if there is +1 file in Downloads folder
while not finished:
files = sorted(glob.iglob(files_path), key=os.path.getctime, reverse=True)
print(len(files))
if (len(files) == len(files_before_download)) or (len(files) == (len(files_before_download)+2)):
print('not finished')
finished = False
else:
print('finished')
finished = True
last_downloaded_file = files[0]
well, what if you check the size of the file until it has x size? There must be an average (Too bored to buid code, build it. Ideas helps too)
TL;DR
- poll for the existence of the file
- poll for non-zero filesize of that file
Observed Behaviour
I noticed that there can be a lag between a downloaded file appearing in the filesystem and the contents of that file being fully written, especially noticeable with large files.
I did some experimenting, using stat_result
from os.stat()
on Linux, and found the following,
- when a file is first opened for writing
st_size == 0
st_atime == st_mtime == st_ctime
- while data is being written to the file
st_size == 0
st_atime == st_mtime == st_ctime
- once the writing is complete and the file is closed
st_size > 0
st_atime < st_mtime == st_ctime
Implementation
- Poll for a file using glob with a configurable timeout
- This is useful when you don’t know exactly what the name of the downloaded file will be
- Poll for the filesize of a specific file to be above a threshold
import glob
import polling2
import os
def poll_for_file_glob(file_glob: str, step: int=1, timeout: int=20):
try:
polling2.poll(lambda: len(glob.glob(file_glob)), step=step, timeout=timeout)
except polling2.TimeoutException:
raise RuntimeError(f"Unable to find file matching glob '{file_glob}'")
return glob.glob(file_glob)[0]
def poll_for_file_size(file_path: str, size_threshold: int=0, step: int=1, timeout: int=20):
try:
polling2.poll(lambda: os.stat(file_path).st_size > size_threshold, step=step, timeout=timeout)
except polling2.TimeoutException:
file_size = os.stat(file_path).st_size
raise RuntimeError(f"File '{file_path}' has size {file_size}, which is not larger than threshold {size_threshold}")
return os.stat(file_path).st_size
You might use these functions like this,
try:
file_glob = "file_*.csv"
file_path = poll_for_file_glob(file_glob=file_glob)
file_size = poll_for_file_size(file_path=file_path)
except:
print(f"Problem polling for file matching '{file_glob}'")
else:
print(f"File '{file_path}' ({file_size}B) is ready")
Based on the answers:
download_dir = '~/Downloads'
def check_downloaded_file(previous_cant_of_files_in_dir):
new_file_was_created = False
def _check_downloaded_file(driver):
global download_dir
nonlocal new_file_was_created
if previous_cant_of_files_in_dir < download_folder_files_cant():
new_file_was_created = True
return (
new_file_was_created
and not last_downloaded_file().endswith('.crdownload')
)
return _check_downloaded_file
def download_folder_files_cant():
global download_dir
return len(glob.glob(os.path.join(os.getcwd(), f"{download_dir}/*")))
def do_and_wait_download(driver, todo, timeout=15):
global download_dir
previous_cant_of_files_in_dir = download_folder_files_cant()
todo(driver)
WebDriverWait(driver, timeout).until(
check_downloaded_file(previous_cant_of_files_in_dir)
)
def click_element(driver, find_by, value):
driver.find_element(find_by, value).click()
def click_link_and_wait_download(driver, find_by, value, timeout=15):
return do_and_wait_download(
driver,
lambda d: click_element(d, find_by, value),
timeout
)
Use as:
# If is just a click
click_link_and_wait_download(driver, By.ID, 'elementLink')
# If you extract the URL from element
url = driver.find_element(By.ID, "downloadButton").get_attribute('href')
do_and_wait_download(driver, lambda d: d.get(url))
If you know anything about the download filename (*.zip?), it is nicer to watch for future state than for crdownload. In my case I had to run this in a loop 700 times.
import glob
download_glob = r'C:UsersuserDownloadsYourDownload*.zip'
def download_wait(dir_file_glob, expected_num_files, timeout=300):
start_time = time.time()
while True:
if (len(glob.glob(dir_file_glob)) - expected_num_files) >= 0:
return
time.sleep(5)
if time.time() - start_time > timeout:
raise TimeoutError('No Download Found')
n_download_files = len(glob.glob(download_glob))
# activate download here
download_wait(download_glob, n_download_files+1)
The only solution that worked for me is:
def get_non_temp_len(download_dir):
non_temp_files = [i for i in os.listdir(download_dir) if not (i.endswith('.tmp') or i.endswith('.crdownload'))]
return len(non_temp_files)
download_dir = 'your/download/dir'
original_count = get_non_temp_len(download_dir) # get the file count at the start
# do your selenium stuff
while original_count == get_non_temp_len(download_dir):
time.sleep(.5) # wait for file count to change
driver.quit()
Credits do @LamerLink on this post
I’ve used selenium to initiate a download. After the download is complete, certain actions need to be taken, is there any simple method to find out when a download has complete? (I am using the FireFox driver)
There is no built-in to selenium way to wait for the download to be completed.
The general idea here would be to wait until a file would appear in your “Downloads” directory.
This might either be achieved by looping over and over again checking for file existence:
Or, by using things like watchdog
to monitor a directory:
With Chrome, files which have not finished downloading have the extension .crdownload
. If you set your download directory properly, then you can wait until the file that you want no longer has this extension. In principle, this is not much different to waiting for file to exist (as suggested by alecxe) – but at least you can monitor progress in this way.
x1=0
while x1==0:
count=0
li = os.listdir("directorypath")
for x1 in li:
if x1.endswith(".crdownload"):
count = count+1
if count==0:
x1=1
else:
x1=0
This works if you are trying to check if a set of files(more than one) have finished downloading.
I came across this problem recently. I was downloading multiple files at once and had to build in a way to timeout if the downloads failed.
The code checks the filenames in some download directory every second and exits once they are complete or if it takes longer than 20 seconds to finish. The returned download time was used to check if the downloads were successful or if it timed out.
import time
import os
def download_wait(path_to_downloads):
seconds = 0
dl_wait = True
while dl_wait and seconds < 20:
time.sleep(1)
dl_wait = False
for fname in os.listdir(path_to_downloads):
if fname.endswith('.crdownload'):
dl_wait = True
seconds += 1
return seconds
I believe that this only works with chrome files as they end with the .crdownload extension. There may be a similar way to check in other browsers.
Edit: I recently changed the way that I use this function for times that .crdownload
does not appear as the extension. Essentially this just waits for the correct number of files as well.
def download_wait(directory, timeout, nfiles=None):
"""
Wait for downloads to finish with a specified timeout.
Args
----
directory : str
The path to the folder where the files will be downloaded.
timeout : int
How many seconds to wait until timing out.
nfiles : int, defaults to None
If provided, also wait for the expected number of files.
"""
seconds = 0
dl_wait = True
while dl_wait and seconds < timeout:
time.sleep(1)
dl_wait = False
files = os.listdir(directory)
if nfiles and len(files) != nfiles:
dl_wait = True
for fname in files:
if fname.endswith('.crdownload'):
dl_wait = True
seconds += 1
return seconds
As answered before, there is no native way to check if download is finished. So here is a helper function that does the job for Firefox and Chrome. One trick is to clear the temp download folder before start a new download. Also, use native pathlib for cross-platform usage.
from pathlib import Path
def is_download_finished(temp_folder):
firefox_temp_file = sorted(Path(temp_folder).glob('*.part'))
chrome_temp_file = sorted(Path(temp_folder).glob('*.crdownload'))
downloaded_files = sorted(Path(temp_folder).glob('*.*'))
if (len(firefox_temp_file) == 0) and
(len(chrome_temp_file) == 0) and
(len(downloaded_files) >= 1):
return True
else:
return False
I know its too late for the answer, though would like to share a hack for future readers.
You can create a thread say thread1 from main thread and initiate your download here.
Now, create some another thread, say thread2 and in there ,let it wait till thread1 completes using join() method.Now here,you can continue your flow of execution after download completes.
Still make sure you dont initiate your download using selenium, instead extract the link using selenium and use requests module to download.
Download using requests module
For eg:
def downloadit():
#download code here
def after_dwn():
dwn_thread.join() #waits till thread1 has completed executing
#next chunk of code after download, goes here
dwn_thread = threading.Thread(target=downloadit)
dwn_thread.start()
metadata_thread = threading.Thread(target=after_dwn)
metadata_thread.start()
this worked for me:
fileends = "crdownload"
while "crdownload" in fileends:
sleep(1)
for fname in os.listdir(some_path):
print(fname)
if "crdownload" in fname:
fileends = "crdownload"
else:
fileends = "None"
Check for "Unconfirmed" key word in file name in download directory:
# wait for download complete
wait = True
while(wait==True):
for fname in os.listdir('pathtodownload directory'):
if ('Unconfirmed') in fname:
print('downloading files ...')
time.sleep(10)
else:
wait=False
print('finished downloading all files ...')
As soon as the the filed download is completed it exits the while loop.
I got a better one though:
So redirect the function that starts the download. e.g. download_function= lambda: element.click()
than check number of files in directory and wait for a new file that doesnt have the download extension. After that rename it. (can be change to move the file instead of renaming it in the same directory)
def save_download(self, directory, download_function, new_name, timeout=30):
"""
Download a file and rename it
:param directory: download location that is set
:param download_function: function to start download
:param new_name: the name that the new download gets
:param timeout: number of seconds to wait for download
:return: path to downloaded file
"""
self.logger.info("Downloading " + new_name)
files_start = os.listdir(directory)
download_function()
wait = True
i = 0
while (wait or len(os.listdir(directory)) == len(files_start)) and i < timeout * 2:
sleep(0.5)
wait = False
for file_name in os.listdir(directory):
if file_name.endswith('.crdownload'):
wait = True
if i == timeout * 2:
self.logger.warning("Documents not downloaded")
raise TimeoutError("File not downloaded")
else:
self.logger.info("Downloading done")
new_file = [name for name in os.listdir(directory) if name not in files_start][0]
self.logger.info("New file found renaming " + new_file + " to " + new_name)
while not os.access(directory + r"\" + new_file, os.W_OK):
sleep(0.5)
self.logger.info("Waiting for write permission")
os.rename(directory + "\" + new_file, directory + "\" + new_name)
return directory + "\" + new_file
import os
import time
def latest_download_file(download_dir:'Downloads folder file path'):
os.chdir(download_dir)
files = sorted(os.listdir(os.getcwd()), key=os.path.getmtime)
latest_f = files[-1]
return latest_f
has_crdownload = True
while has_crdownload:
time.sleep(.6)
newest_file = latest_download_file()
if "crdownload" in newest_file:
has_crdownload = True
else:
has_crdownload = False
This is a combination of a few solutions. I didn’t like that I had to scan the entire downloads folder for a file ending in "crdownload". This code implements a function that pulls the newest file in downloads folder. Then it simply checks if that file is still being downloaded. Used it for a Selenium tool I am building worked very well.
If using Selenium and Chrome, you can write a custom wait condition such as:
class file_has_been_downloaded(object):
def __init__(self, dir, number):
self.dir = dir
self.number = number
def __call__(self, driver):
print(count_files(dir), '->', self.number)
return count_files(dir) > self.number
The function count_files just verifies that the file has been added to the folder
def count_files(direct):
for root, dirs, files in os.walk(direct):
return len(list(f for f in files if f.startswith('MyPrefix') and (
not f.endswith('.crdownload')) ))
Then to implement this in your code:
files = count_files(dir)
<< copy the file. Possibly use shutil >>
WebDriverWait(driver, 30).until(file_has_been_downloaded(dir, files))
create a function that uses "requests" to get the file content and call that one, your program will not move forward unless the file is downloaded
import requests
from selenium import webdriver
driver = webdriver.Chrome()
# Open the website
driver.get(website_url_)
x = driver.find_element_by_partial_link_text('download')
y = x.get_attribute("href")
fc = requests.get(y)
fname = x.text
with open(fname, 'wb') as f:
f.write(fc.content)
This is VERY SIMPLE and worked for me (and works fot any extention)
import os, glob and time (not truly needed)
# count how many files you have in Downloads folder before download
user = os.getlogin()
downloads_folder = (r"C:/Users/" + user + "/Downloads/")
files_path = os.path.join(downloads_folder, '*')
files = sorted(glob.iglob(files_path), key=os.path.getctime, reverse=True)
files_before_download = files
print(f'files before download: {len(files)}')
finished = False
# ...
# code
# to
# download
# file
# ...
# just for extra safety
time.sleep(0.5)
# wait for the download to finish if there is +1 file in Downloads folder
while not finished:
files = sorted(glob.iglob(files_path), key=os.path.getctime, reverse=True)
print(len(files))
if (len(files) == len(files_before_download)) or (len(files) == (len(files_before_download)+2)):
print('not finished')
finished = False
else:
print('finished')
finished = True
last_downloaded_file = files[0]
well, what if you check the size of the file until it has x size? There must be an average (Too bored to buid code, build it. Ideas helps too)
TL;DR
- poll for the existence of the file
- poll for non-zero filesize of that file
Observed Behaviour
I noticed that there can be a lag between a downloaded file appearing in the filesystem and the contents of that file being fully written, especially noticeable with large files.
I did some experimenting, using stat_result
from os.stat()
on Linux, and found the following,
- when a file is first opened for writing
st_size == 0
st_atime == st_mtime == st_ctime
- while data is being written to the file
st_size == 0
st_atime == st_mtime == st_ctime
- once the writing is complete and the file is closed
st_size > 0
st_atime < st_mtime == st_ctime
Implementation
- Poll for a file using glob with a configurable timeout
- This is useful when you don’t know exactly what the name of the downloaded file will be
- Poll for the filesize of a specific file to be above a threshold
import glob
import polling2
import os
def poll_for_file_glob(file_glob: str, step: int=1, timeout: int=20):
try:
polling2.poll(lambda: len(glob.glob(file_glob)), step=step, timeout=timeout)
except polling2.TimeoutException:
raise RuntimeError(f"Unable to find file matching glob '{file_glob}'")
return glob.glob(file_glob)[0]
def poll_for_file_size(file_path: str, size_threshold: int=0, step: int=1, timeout: int=20):
try:
polling2.poll(lambda: os.stat(file_path).st_size > size_threshold, step=step, timeout=timeout)
except polling2.TimeoutException:
file_size = os.stat(file_path).st_size
raise RuntimeError(f"File '{file_path}' has size {file_size}, which is not larger than threshold {size_threshold}")
return os.stat(file_path).st_size
You might use these functions like this,
try:
file_glob = "file_*.csv"
file_path = poll_for_file_glob(file_glob=file_glob)
file_size = poll_for_file_size(file_path=file_path)
except:
print(f"Problem polling for file matching '{file_glob}'")
else:
print(f"File '{file_path}' ({file_size}B) is ready")
Based on the answers:
download_dir = '~/Downloads'
def check_downloaded_file(previous_cant_of_files_in_dir):
new_file_was_created = False
def _check_downloaded_file(driver):
global download_dir
nonlocal new_file_was_created
if previous_cant_of_files_in_dir < download_folder_files_cant():
new_file_was_created = True
return (
new_file_was_created
and not last_downloaded_file().endswith('.crdownload')
)
return _check_downloaded_file
def download_folder_files_cant():
global download_dir
return len(glob.glob(os.path.join(os.getcwd(), f"{download_dir}/*")))
def do_and_wait_download(driver, todo, timeout=15):
global download_dir
previous_cant_of_files_in_dir = download_folder_files_cant()
todo(driver)
WebDriverWait(driver, timeout).until(
check_downloaded_file(previous_cant_of_files_in_dir)
)
def click_element(driver, find_by, value):
driver.find_element(find_by, value).click()
def click_link_and_wait_download(driver, find_by, value, timeout=15):
return do_and_wait_download(
driver,
lambda d: click_element(d, find_by, value),
timeout
)
Use as:
# If is just a click
click_link_and_wait_download(driver, By.ID, 'elementLink')
# If you extract the URL from element
url = driver.find_element(By.ID, "downloadButton").get_attribute('href')
do_and_wait_download(driver, lambda d: d.get(url))
If you know anything about the download filename (*.zip?), it is nicer to watch for future state than for crdownload. In my case I had to run this in a loop 700 times.
import glob
download_glob = r'C:UsersuserDownloadsYourDownload*.zip'
def download_wait(dir_file_glob, expected_num_files, timeout=300):
start_time = time.time()
while True:
if (len(glob.glob(dir_file_glob)) - expected_num_files) >= 0:
return
time.sleep(5)
if time.time() - start_time > timeout:
raise TimeoutError('No Download Found')
n_download_files = len(glob.glob(download_glob))
# activate download here
download_wait(download_glob, n_download_files+1)
The only solution that worked for me is:
def get_non_temp_len(download_dir):
non_temp_files = [i for i in os.listdir(download_dir) if not (i.endswith('.tmp') or i.endswith('.crdownload'))]
return len(non_temp_files)
download_dir = 'your/download/dir'
original_count = get_non_temp_len(download_dir) # get the file count at the start
# do your selenium stuff
while original_count == get_non_temp_len(download_dir):
time.sleep(.5) # wait for file count to change
driver.quit()
Credits do @LamerLink on this post