Python equivalent of a given wget command
Question:
I’m trying to create a Python function that does the same thing as this wget command:
wget -c --read-timeout=5 --tries=0 "$URL"
-c
– Continue from where you left off if the download is interrupted.
--read-timeout=5
– If there is no new data coming in for over 5 seconds, give up and try again. Given -c
this mean it will try again from where it left off.
--tries=0
– Retry forever.
Those three arguments used in tandem results in a download that cannot fail.
I want to duplicate those features in my Python script, but I don’t know where to begin…
Answers:
urllib.request should work.
Just set it up in a while(not done) loop, check if a localfile already exists, if it does send a GET with a RANGE header, specifying how far you got in downloading the localfile.
Be sure to use read() to append to the localfile until an error occurs.
This is also potentially a duplicate of Python urllib2 resume download doesn't work when network reconnects
import urllib2
import time
max_attempts = 80
attempts = 0
sleeptime = 10 #in seconds, no reason to continuously try if network is down
#while true: #Possibly Dangerous
while attempts < max_attempts:
time.sleep(sleeptime)
try:
response = urllib2.urlopen("http://example.com", timeout = 5)
content = response.read()
f = open( "local/index.html", 'w' )
f.write( content )
f.close()
break
except urllib2.URLError as e:
attempts += 1
print type(e)
There is also a nice Python module named wget
that is pretty easy to use. Keep in mind that the package has not been updated since 2015 and has not implemented a number of important features, so it may be better to use other methods. It depends entirely on your use case. For simple downloading, this module is the ticket. If you need to do more, there are other solutions out there.
>>> import wget
>>> url = 'http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3'
>>> filename = wget.download(url)
100% [................................................] 3841532 / 3841532>
>> filename
'razorback.mp3'
Enjoy.
However, if wget
doesn’t work (I’ve had trouble with certain PDF files), try this solution.
Edit: You can also use the out
parameter to use a custom output directory instead of current working directory.
>>> output_directory = <directory_name>
>>> filename = wget.download(url, out=output_directory)
>>> filename
'razorback.mp3'
I had to do something like this on a version of linux that didn’t have the right options compiled into wget. This example is for downloading the memory analysis tool ‘guppy’. I’m not sure if it’s important or not, but I kept the target file’s name the same as the url target name…
Here’s what I came up with:
python -c "import requests; r = requests.get('https://pypi.python.org/packages/source/g/guppy/guppy-0.1.10.tar.gz') ; open('guppy-0.1.10.tar.gz' , 'wb').write(r.content)"
That’s the one-liner, here’s it a little more readable:
import requests
fname = 'guppy-0.1.10.tar.gz'
url = 'https://pypi.python.org/packages/source/g/guppy/' + fname
r = requests.get(url)
open(fname , 'wb').write(r.content)
This worked for downloading a tarball. I was able to extract the package and download it after downloading.
EDIT:
To address a question, here is an implementation with a progress bar printed to STDOUT. There is probably a more portable way to do this without the clint
package, but this was tested on my machine and works fine:
#!/usr/bin/env python
from clint.textui import progress
import requests
fname = 'guppy-0.1.10.tar.gz'
url = 'https://pypi.python.org/packages/source/g/guppy/' + fname
r = requests.get(url, stream=True)
with open(fname, 'wb') as f:
total_length = int(r.headers.get('content-length'))
for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length/1024) + 1):
if chunk:
f.write(chunk)
f.flush()
Let me Improve a example with threads in case you want download many files.
import math
import random
import threading
import requests
from clint.textui import progress
# You must define a proxy list
# I suggests https://free-proxy-list.net/
proxies = {
0: {'http': 'http://34.208.47.183:80'},
1: {'http': 'http://40.69.191.149:3128'},
2: {'http': 'http://104.154.205.214:1080'},
3: {'http': 'http://52.11.190.64:3128'}
}
# you must define the list for files do you want download
videos = [
"https://i.stack.imgur.com/g2BHi.jpg",
"https://i.stack.imgur.com/NURaP.jpg"
]
downloaderses = list()
def downloaders(video, selected_proxy):
print("Downloading file named {} by proxy {}...".format(video, selected_proxy))
r = requests.get(video, stream=True, proxies=selected_proxy)
nombre_video = video.split("/")[3]
with open(nombre_video, 'wb') as f:
total_length = int(r.headers.get('content-length'))
for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length / 1024) + 1):
if chunk:
f.write(chunk)
f.flush()
for video in videos:
selected_proxy = proxies[math.floor(random.random() * len(proxies))]
t = threading.Thread(target=downloaders, args=(video, selected_proxy))
downloaderses.append(t)
for _downloaders in downloaderses:
_downloaders.start()
easy as py:
class Downloder():
def download_manager(self, url, destination='Files/DownloderApp/', try_number="10", time_out="60"):
#threading.Thread(target=self._wget_dl, args=(url, destination, try_number, time_out, log_file)).start()
if self._wget_dl(url, destination, try_number, time_out, log_file) == 0:
return True
else:
return False
def _wget_dl(self,url, destination, try_number, time_out):
import subprocess
command=["wget", "-c", "-P", destination, "-t", try_number, "-T", time_out , url]
try:
download_state=subprocess.call(command)
except Exception as e:
print(e)
#if download_state==0 => successfull download
return download_state
A solution that I often find simpler and more robust is to simply execute a terminal command within python. In your case:
import os
url = 'https://www.someurl.com'
os.system(f"""wget -c --read-timeout=5 --tries=0 "{url}"""")
TensorFlow makes life easier. file path gives us the location of downloaded file.
import tensorflow as tf
tf.keras.utils.get_file(origin='https://storage.googleapis.com/tf-datasets/titanic/train.csv',
fname='train.csv',
untar=False, extract=False)
For Windows and Python 3.x, my two cents contribution about renaming the file on download :
- Install wget module :
pip install wget
- Use wget :
import wget
wget.download('Url', 'C:\PathToMyDownloadFolder\NewFileName.extension')
Truely working command line example :
python -c "import wget; wget.download(""https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.17.2.tar.xz"", ""C:\Users\TestName.TestExtension"")"
Note : ‘C:\PathToMyDownloadFolder\NewFileName.extension’ is not mandatory. By default, the file is not renamed, and the download folder is your local path.
Here’s the code adopted from the torchvision library:
import urllib
def download_url(url, root, filename=None):
"""Download a file from a url and place it in root.
Args:
url (str): URL to download file from
root (str): Directory to place downloaded file in
filename (str, optional): Name to save the file under. If None, use the basename of the URL
"""
root = os.path.expanduser(root)
if not filename:
filename = os.path.basename(url)
fpath = os.path.join(root, filename)
os.makedirs(root, exist_ok=True)
try:
print('Downloading ' + url + ' to ' + fpath)
urllib.request.urlretrieve(url, fpath)
except (urllib.error.URLError, IOError) as e:
if url[:5] == 'https':
url = url.replace('https:', 'http:')
print('Failed download. Trying https -> http instead.'
' Downloading ' + url + ' to ' + fpath)
urllib.request.urlretrieve(url, fpath)
If you are ok to take dependency on torchvision library then you also also simply do:
from torchvision.datasets.utils import download_url
download_url('http://something.com/file.zip', '~/my_folder`)
I’m trying to create a Python function that does the same thing as this wget command:
wget -c --read-timeout=5 --tries=0 "$URL"
-c
– Continue from where you left off if the download is interrupted.
--read-timeout=5
– If there is no new data coming in for over 5 seconds, give up and try again. Given -c
this mean it will try again from where it left off.
--tries=0
– Retry forever.
Those three arguments used in tandem results in a download that cannot fail.
I want to duplicate those features in my Python script, but I don’t know where to begin…
urllib.request should work.
Just set it up in a while(not done) loop, check if a localfile already exists, if it does send a GET with a RANGE header, specifying how far you got in downloading the localfile.
Be sure to use read() to append to the localfile until an error occurs.
This is also potentially a duplicate of Python urllib2 resume download doesn't work when network reconnects
import urllib2
import time
max_attempts = 80
attempts = 0
sleeptime = 10 #in seconds, no reason to continuously try if network is down
#while true: #Possibly Dangerous
while attempts < max_attempts:
time.sleep(sleeptime)
try:
response = urllib2.urlopen("http://example.com", timeout = 5)
content = response.read()
f = open( "local/index.html", 'w' )
f.write( content )
f.close()
break
except urllib2.URLError as e:
attempts += 1
print type(e)
There is also a nice Python module named wget
that is pretty easy to use. Keep in mind that the package has not been updated since 2015 and has not implemented a number of important features, so it may be better to use other methods. It depends entirely on your use case. For simple downloading, this module is the ticket. If you need to do more, there are other solutions out there.
>>> import wget
>>> url = 'http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3'
>>> filename = wget.download(url)
100% [................................................] 3841532 / 3841532>
>> filename
'razorback.mp3'
Enjoy.
However, if wget
doesn’t work (I’ve had trouble with certain PDF files), try this solution.
Edit: You can also use the out
parameter to use a custom output directory instead of current working directory.
>>> output_directory = <directory_name>
>>> filename = wget.download(url, out=output_directory)
>>> filename
'razorback.mp3'
I had to do something like this on a version of linux that didn’t have the right options compiled into wget. This example is for downloading the memory analysis tool ‘guppy’. I’m not sure if it’s important or not, but I kept the target file’s name the same as the url target name…
Here’s what I came up with:
python -c "import requests; r = requests.get('https://pypi.python.org/packages/source/g/guppy/guppy-0.1.10.tar.gz') ; open('guppy-0.1.10.tar.gz' , 'wb').write(r.content)"
That’s the one-liner, here’s it a little more readable:
import requests
fname = 'guppy-0.1.10.tar.gz'
url = 'https://pypi.python.org/packages/source/g/guppy/' + fname
r = requests.get(url)
open(fname , 'wb').write(r.content)
This worked for downloading a tarball. I was able to extract the package and download it after downloading.
EDIT:
To address a question, here is an implementation with a progress bar printed to STDOUT. There is probably a more portable way to do this without the clint
package, but this was tested on my machine and works fine:
#!/usr/bin/env python
from clint.textui import progress
import requests
fname = 'guppy-0.1.10.tar.gz'
url = 'https://pypi.python.org/packages/source/g/guppy/' + fname
r = requests.get(url, stream=True)
with open(fname, 'wb') as f:
total_length = int(r.headers.get('content-length'))
for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length/1024) + 1):
if chunk:
f.write(chunk)
f.flush()
Let me Improve a example with threads in case you want download many files.
import math
import random
import threading
import requests
from clint.textui import progress
# You must define a proxy list
# I suggests https://free-proxy-list.net/
proxies = {
0: {'http': 'http://34.208.47.183:80'},
1: {'http': 'http://40.69.191.149:3128'},
2: {'http': 'http://104.154.205.214:1080'},
3: {'http': 'http://52.11.190.64:3128'}
}
# you must define the list for files do you want download
videos = [
"https://i.stack.imgur.com/g2BHi.jpg",
"https://i.stack.imgur.com/NURaP.jpg"
]
downloaderses = list()
def downloaders(video, selected_proxy):
print("Downloading file named {} by proxy {}...".format(video, selected_proxy))
r = requests.get(video, stream=True, proxies=selected_proxy)
nombre_video = video.split("/")[3]
with open(nombre_video, 'wb') as f:
total_length = int(r.headers.get('content-length'))
for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length / 1024) + 1):
if chunk:
f.write(chunk)
f.flush()
for video in videos:
selected_proxy = proxies[math.floor(random.random() * len(proxies))]
t = threading.Thread(target=downloaders, args=(video, selected_proxy))
downloaderses.append(t)
for _downloaders in downloaderses:
_downloaders.start()
easy as py:
class Downloder():
def download_manager(self, url, destination='Files/DownloderApp/', try_number="10", time_out="60"):
#threading.Thread(target=self._wget_dl, args=(url, destination, try_number, time_out, log_file)).start()
if self._wget_dl(url, destination, try_number, time_out, log_file) == 0:
return True
else:
return False
def _wget_dl(self,url, destination, try_number, time_out):
import subprocess
command=["wget", "-c", "-P", destination, "-t", try_number, "-T", time_out , url]
try:
download_state=subprocess.call(command)
except Exception as e:
print(e)
#if download_state==0 => successfull download
return download_state
A solution that I often find simpler and more robust is to simply execute a terminal command within python. In your case:
import os
url = 'https://www.someurl.com'
os.system(f"""wget -c --read-timeout=5 --tries=0 "{url}"""")
TensorFlow makes life easier. file path gives us the location of downloaded file.
import tensorflow as tf
tf.keras.utils.get_file(origin='https://storage.googleapis.com/tf-datasets/titanic/train.csv',
fname='train.csv',
untar=False, extract=False)
For Windows and Python 3.x, my two cents contribution about renaming the file on download :
- Install wget module :
pip install wget
- Use wget :
import wget
wget.download('Url', 'C:\PathToMyDownloadFolder\NewFileName.extension')
Truely working command line example :
python -c "import wget; wget.download(""https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.17.2.tar.xz"", ""C:\Users\TestName.TestExtension"")"
Note : ‘C:\PathToMyDownloadFolder\NewFileName.extension’ is not mandatory. By default, the file is not renamed, and the download folder is your local path.
Here’s the code adopted from the torchvision library:
import urllib
def download_url(url, root, filename=None):
"""Download a file from a url and place it in root.
Args:
url (str): URL to download file from
root (str): Directory to place downloaded file in
filename (str, optional): Name to save the file under. If None, use the basename of the URL
"""
root = os.path.expanduser(root)
if not filename:
filename = os.path.basename(url)
fpath = os.path.join(root, filename)
os.makedirs(root, exist_ok=True)
try:
print('Downloading ' + url + ' to ' + fpath)
urllib.request.urlretrieve(url, fpath)
except (urllib.error.URLError, IOError) as e:
if url[:5] == 'https':
url = url.replace('https:', 'http:')
print('Failed download. Trying https -> http instead.'
' Downloading ' + url + ' to ' + fpath)
urllib.request.urlretrieve(url, fpath)
If you are ok to take dependency on torchvision library then you also also simply do:
from torchvision.datasets.utils import download_url
download_url('http://something.com/file.zip', '~/my_folder`)