parallelization of downloading thousands of files using wget
Question:
I have thousands of the files like below to be downloaded.
urls = ['https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0450.061.2019001110251.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0455.061.2019001110452.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0500.061.2019001110658.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0535.061.2019001110116.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0555.061.2019001132709.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0615.061.2019001132734.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0630.061.2019001132950.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0635.061.2019001133203.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0650.061.2019001132727.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0655.061.2019001132653.hdf']
I can download them one by one using wget as follows.
#wget is here, https://eternallybored.org/misc/wget/1.21.3/64/wget.exe
import os, glob, subprocess
import itertools
import multiprocessing as mp
import concurrent.futures
header = "authorizationkey"
rd = '.\results\'
for url in urls:
app = r"C:UsersdanielDownloadswget.exe"
subprocess.call([app,
'--header', header,
'-P', rd, url])
However, downloading one by one is very slow. So, I wanted to use parallelization or multithreading approach. I tried it as follows.
def doers(urls):
for url in urls:
app = r"C:UsersdanielDownloadswget.exe"
subprocess.call([app,
'--header', authorization,
'-P', rd, url])
def batched(iterable, n):
it = iter(iterable)
while (batch := tuple(itertools.islice(it, n))):
yield batch
num_cpus = mp.cpu_count() - 1
with mp.Pool(num_cpus) as pool:
pool.map(doers, batched(urls, num_cpus))
However, my pc just hangs and does not produce any results.
Can someone help me ?
Answers:
Multithreading is probably optimum for this. You don’t need to use wget if you have the requests module installed.
You may want to consider reducing the number of concurrent threads (ThreadPoolExecutor constructor) as you’re likely to "flood" your network as you stream these fairly large files.
Something like this:
import requests
import os
from concurrent.futures import ThreadPoolExecutor
import sys
HEADER = {AuthorizationKey}
CHUNK = 16 * 1024
TARGET = '/Volumes/G-Drive/results'
def process(url):
try:
with requests.get(url, headers=HEADER, stream=True) as r:
r.raise_for_status()
*_, filename = url.split('/')
os.makedirs(TARGET, exist_ok=True)
with open(os.path.join(TARGET, filename), 'wb') as hdf:
for chunk in r.iter_content(chunk_size=CHUNK):
hdf.write(chunk)
except Exception as e:
print(e, file=sys.stderr)
def main():
urls = [
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0450.061.2019001110251.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0455.061.2019001110452.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0500.061.2019001110658.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0535.061.2019001110116.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0555.061.2019001132709.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0615.061.2019001132734.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0630.061.2019001132950.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0635.061.2019001133203.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0650.061.2019001132727.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0655.061.2019001132653.hdf'
]
with ThreadPoolExecutor() as tpe:
tpe.map(process, urls)
if __name__ == '__main__':
main()
wanted to use parallelization
This is what aria2c
aims to do. After reading aria2c
man page I suggest you use it following way
aria2c.exe --header="authorizationkey" -i urls.txt
where urls.txt
is list of URLs to download, 1 URL per line. If it will work fast enough then fine, if not you might try tinkering with --max-concurrent-downloads=<N>
As an alternative, you can use AIOHTTP if you prefer an asyncIO interface:
import asyncio
import aiohttp as aio
async def download(url: str, session: aio.ClientSession):
async with session.get(url) as response:
data = await response.read()
# Write the data to an HDF file
async def main():
urls = [...]
connector = aio.TCPConnector(limit_per_host=32)
async with aio.ClientSession(connector=connector) as session:
await asyncio.gather(*[download(url, session) for url in urls])
if __name__ == "__main__":
asyncio.run(main())
You can control the maximum number of concurrent requests (per host or global) via the aiohttp.TCPConnector
. AIOHTTP caches DNS resolutions by default and using a single ClientSession
for all the requests may improve the overall performance.
There are also API’s to stream the received data in chunks if the data is big.
I have thousands of the files like below to be downloaded.
urls = ['https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0450.061.2019001110251.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0455.061.2019001110452.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0500.061.2019001110658.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0535.061.2019001110116.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0555.061.2019001132709.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0615.061.2019001132734.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0630.061.2019001132950.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0635.061.2019001133203.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0650.061.2019001132727.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0655.061.2019001132653.hdf']
I can download them one by one using wget as follows.
#wget is here, https://eternallybored.org/misc/wget/1.21.3/64/wget.exe
import os, glob, subprocess
import itertools
import multiprocessing as mp
import concurrent.futures
header = "authorizationkey"
rd = '.\results\'
for url in urls:
app = r"C:UsersdanielDownloadswget.exe"
subprocess.call([app,
'--header', header,
'-P', rd, url])
However, downloading one by one is very slow. So, I wanted to use parallelization or multithreading approach. I tried it as follows.
def doers(urls):
for url in urls:
app = r"C:UsersdanielDownloadswget.exe"
subprocess.call([app,
'--header', authorization,
'-P', rd, url])
def batched(iterable, n):
it = iter(iterable)
while (batch := tuple(itertools.islice(it, n))):
yield batch
num_cpus = mp.cpu_count() - 1
with mp.Pool(num_cpus) as pool:
pool.map(doers, batched(urls, num_cpus))
However, my pc just hangs and does not produce any results.
Can someone help me ?
Multithreading is probably optimum for this. You don’t need to use wget if you have the requests module installed.
You may want to consider reducing the number of concurrent threads (ThreadPoolExecutor constructor) as you’re likely to "flood" your network as you stream these fairly large files.
Something like this:
import requests
import os
from concurrent.futures import ThreadPoolExecutor
import sys
HEADER = {AuthorizationKey}
CHUNK = 16 * 1024
TARGET = '/Volumes/G-Drive/results'
def process(url):
try:
with requests.get(url, headers=HEADER, stream=True) as r:
r.raise_for_status()
*_, filename = url.split('/')
os.makedirs(TARGET, exist_ok=True)
with open(os.path.join(TARGET, filename), 'wb') as hdf:
for chunk in r.iter_content(chunk_size=CHUNK):
hdf.write(chunk)
except Exception as e:
print(e, file=sys.stderr)
def main():
urls = [
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0450.061.2019001110251.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0455.061.2019001110452.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0500.061.2019001110658.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0535.061.2019001110116.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0555.061.2019001132709.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0615.061.2019001132734.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0630.061.2019001132950.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0635.061.2019001133203.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0650.061.2019001132727.hdf',
'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0655.061.2019001132653.hdf'
]
with ThreadPoolExecutor() as tpe:
tpe.map(process, urls)
if __name__ == '__main__':
main()
wanted to use parallelization
This is what aria2c
aims to do. After reading aria2c
man page I suggest you use it following way
aria2c.exe --header="authorizationkey" -i urls.txt
where urls.txt
is list of URLs to download, 1 URL per line. If it will work fast enough then fine, if not you might try tinkering with --max-concurrent-downloads=<N>
As an alternative, you can use AIOHTTP if you prefer an asyncIO interface:
import asyncio
import aiohttp as aio
async def download(url: str, session: aio.ClientSession):
async with session.get(url) as response:
data = await response.read()
# Write the data to an HDF file
async def main():
urls = [...]
connector = aio.TCPConnector(limit_per_host=32)
async with aio.ClientSession(connector=connector) as session:
await asyncio.gather(*[download(url, session) for url in urls])
if __name__ == "__main__":
asyncio.run(main())
You can control the maximum number of concurrent requests (per host or global) via the aiohttp.TCPConnector
. AIOHTTP caches DNS resolutions by default and using a single ClientSession
for all the requests may improve the overall performance.
There are also API’s to stream the received data in chunks if the data is big.