Download files from a public S3 bucket

Question:

I’m trying to download some files from a public s3 bucket as part of the Google Analytics course. However, I am not getting the links returned in my request. I’m not sure if I need to use boto3 or a different API package since it’s a public URL with visible links. Reading the docs from Boto3, I am not 100% sure on how I would list the zip files that are list on the page links. Sorry I’m fairly new at this.

So far, this is what I’ve gotten:

    import requests
    from bs4 import BeautifulSoup

    r = requests.get('https://divvy-tripdata.s3.amazonaws.com/index.html')
    data = r.text
    soup = BeautifulSoup(data)
    
    links = []
    for link in soup.find_all('a'):
        links.append(link.get('href'))

The request to the URL is returning a 200, however, the href links[] from the ‘a’ tags are coming up empty. I am trying to get all of the hrefs so I can create a loop to download the files with an urllib.request. to the base URL with a /filename for each zip file.

Any help would be greatly appreciated and thank you in advance!

Asked By: Mister Dabu

||

Answers:

It would appear that your goal is to download files from a public Amazon S3 bucket.

The easiest is to use the AWS Command-Line Interface (CLI). Since the bucket is public you do not require any credentials:

aws s3 --no-sign-request sync s3://divvy-tripdata .
% aws s3 --no-sign-request sync s3://divvy-tripdata .
download: s3://divvy-tripdata/202006-divvy-tripdata.zip to ./202006-divvy-tripdata.zip
download: s3://divvy-tripdata/202012-divvy-tripdata.zip to ./202012-divvy-tripdata.zip
download: s3://divvy-tripdata/202007-divvy-tripdata.zip to ./202007-divvy-tripdata.zip
download: s3://divvy-tripdata/202010-divvy-tripdata.zip to ./202010-divvy-tripdata.zip
download: s3://divvy-tripdata/202011-divvy-tripdata.zip to ./202011-divvy-tripdata.zip
etc
Answered By: John Rotenstein

This is the solution for those looking for the Google Data Analytics Case 1 files download from divvy-tripdata.s3.amazonaws.com/index.html who lack programming expertise but don’t want to download Case files one by one.
At first, via terminal (assuming you use Mac), run the installation of the amazon command line interface. I spent hours trying to download via python and beautifulsoup and failed, so this is easy to do instead:

In the terminal, run this:

sudo easy_install awscli

or (which worked better for me)

sudo pip install awscli

Either of the above will install the command line interface, and then a simple command downloads all zip files into the current folder on the hard drive.

Run in the terminal

aws s3 --no-sign-request sync s3://divvy-tripdata .

You can play with the destination folder, of course.

You should see this in the Terminal. as a result:

download: s3://divvy-tripdata/202004-divvy-tripdata.zip to ./202004-divvy-tripdata.zip
download: s3://divvy-tripdata/202005-divvy-tripdata.zip to ./202005-divvy-tripdata.zip
download: s3://divvy-tripdata/202007-divvy-tripdata.zip to ./202007-divvy-tripdata.zip
download: s3://divvy-tripdata/202006-divvy-tripdata.zip to ./202006-divvy-tripdata.zip
download: s3://divvy-tripdata/202011-divvy-tripdata.zip to ./202011-divvy-tripdata.zip
download: s3://divvy-tripdata/202102-divvy-tripdata.zip to ./202102-divvy-tripdata.zip
download: s3://divvy-tripdata/202009-divvy-tripdata.zip to ./202009-divvy-tripdata.zip
Answered By: Maksim Dementev

Thank you for your comments. While the AWS CLI worked just fine, I wanted to bake this into my python script for future reference ease of access. As such, I was able to figured out how to download the zip files using boto3.

This solution uses the lower level for boto3, botocore to bypass the authentication using config ‘UNSIGNED’. I found out about this through another Github project called s3-key-listener which "List all keys in any public Amazon s3 bucket, option to check if each object is public or private. Saves result as a .csv file"

#Install boto3
!pip install boto3 #this includes botocore    

import boto3
from botocore import UNSIGNED
from botocore.client import Config
import os #this is for joining the download directory

def get_s3_public_data(bucket='bucket_name'):
    #create the s3 client and assign credentials (UNSIGEND for public 
                                                     bucket)
    client = boto3.client('s3', config=Config(signature_version=UNSIGNED))

    #create a list of 'Contect' objects from the s3 bucket
    list_files = client.list_objects(Bucket=cyclistic_bucket)['Contents']

    for key in list_files:
        if key['Key'].endswith('.zip'):
            print(f'downloading... {key["Key"]}') #print file name
            client.download_file(
                                    Bucket=bucket, #assign bucket name
                                    Key=key['Key'], #key is the file name
                                    Filename=os.path.join('./data', 
                                        key['Key']) #storage file path
                                )
        else:
            pass #if it's not a zip file do nothing

get_s3_public_data()

This connects to the s3 bucket and fetches the zip files for me. Hope this helps anyone else dealing with a similar issue.

Answered By: Mister Dabu