parallelization of downloading thousands of files using wget

Question:

I have thousands of the files like below to be downloaded.

urls = ['https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0450.061.2019001110251.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0455.061.2019001110452.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0500.061.2019001110658.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0535.061.2019001110116.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0555.061.2019001132709.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0615.061.2019001132734.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0630.061.2019001132950.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0635.061.2019001133203.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0650.061.2019001132727.hdf', 'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0655.061.2019001132653.hdf']

I can download them one by one using wget as follows.

#wget is here, https://eternallybored.org/misc/wget/1.21.3/64/wget.exe

import os, glob, subprocess
import itertools
import multiprocessing as mp
import concurrent.futures

header = "authorizationkey"
rd = '.\results\'
for url in urls:
    app = r"C:UsersdanielDownloadswget.exe"
    subprocess.call([app, 
                     '--header', header,
                     '-P', rd, url])

However, downloading one by one is very slow. So, I wanted to use parallelization or multithreading approach. I tried it as follows.

def doers(urls):
    for url in urls:
        app = r"C:UsersdanielDownloadswget.exe"
        subprocess.call([app, 
                         '--header', authorization,
                         '-P', rd, url])
def batched(iterable, n):
    it = iter(iterable)
    while (batch := tuple(itertools.islice(it, n))):
        yield batch

num_cpus = mp.cpu_count() - 1 

with mp.Pool(num_cpus) as pool:
    pool.map(doers, batched(urls, num_cpus))

However, my pc just hangs and does not produce any results.
Can someone help me ?

Asked By: daniel

||

Answers:

Multithreading is probably optimum for this. You don’t need to use wget if you have the requests module installed.

You may want to consider reducing the number of concurrent threads (ThreadPoolExecutor constructor) as you’re likely to "flood" your network as you stream these fairly large files.

Something like this:

import requests
import os
from concurrent.futures import ThreadPoolExecutor
import sys


HEADER = {AuthorizationKey}
CHUNK = 16 * 1024
TARGET = '/Volumes/G-Drive/results'

def process(url):
    try:
        with requests.get(url, headers=HEADER, stream=True) as r:
            r.raise_for_status()
            *_, filename = url.split('/')
            os.makedirs(TARGET, exist_ok=True)
            with open(os.path.join(TARGET, filename), 'wb') as hdf:
                for chunk in r.iter_content(chunk_size=CHUNK):
                    hdf.write(chunk)
    except Exception as e:
        print(e, file=sys.stderr)


def main():
    urls = [
        'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0450.061.2019001110251.hdf',
        'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0455.061.2019001110452.hdf',
        'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0500.061.2019001110658.hdf',
        'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0535.061.2019001110116.hdf',
        'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0555.061.2019001132709.hdf',
        'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0615.061.2019001132734.hdf',
        'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0630.061.2019001132950.hdf',
        'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0635.061.2019001133203.hdf',
        'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0650.061.2019001132727.hdf',
        'https://ladsweb.modaps.eosdis.nasa.gov//archive/allData/61/MOD03/2019/001/MOD03.A2019001.0655.061.2019001132653.hdf'
    ]
    with ThreadPoolExecutor() as tpe:
        tpe.map(process, urls)


if __name__ == '__main__':
    main()
Answered By: DarkKnight

wanted to use parallelization

This is what aria2c aims to do. After reading aria2c man page I suggest you use it following way

aria2c.exe --header="authorizationkey" -i urls.txt

where urls.txt is list of URLs to download, 1 URL per line. If it will work fast enough then fine, if not you might try tinkering with --max-concurrent-downloads=<N>

Answered By: Daweo

As an alternative, you can use AIOHTTP if you prefer an asyncIO interface:

import asyncio
import aiohttp as aio


async def download(url: str, session: aio.ClientSession):
    async with session.get(url) as response:
        data = await response.read()
        # Write the data to an HDF file


async def main():
    urls = [...]

    connector = aio.TCPConnector(limit_per_host=32)

    async with aio.ClientSession(connector=connector) as session:
        await asyncio.gather(*[download(url, session) for url in urls])


if __name__ == "__main__":
    asyncio.run(main())

You can control the maximum number of concurrent requests (per host or global) via the aiohttp.TCPConnector. AIOHTTP caches DNS resolutions by default and using a single ClientSession for all the requests may improve the overall performance.

There are also API’s to stream the received data in chunks if the data is big.

Answered By: Louis Lac