How to scrape zip files into a single dataframe in python

Question:

I am very new to web scrapping and I am trying to understand how I can scrape all the zip files and regular files that are on this website. The end goal is the scrape all the data, I was originally thinking I could use pd.read_html and feed in a list of each link and loop through each zip file.

I am very new to web scraping so any help at all would be very useful, I have tried a few examples this far please see the below code

import pandas as pd
pd.read_html("https://www.omie.es/en/file-access-list?parents%5B0%5D=/&parents%5B1%5D=Day-ahead%20Market&parents%5B2%5D=1.%20Prices&dir=%20Day-ahead%20market%20hourly%20prices%20in%20Spain&realdir=marginalpdbc",match="marginalpdbc_2017.zip")

So this is what I would like the output to look like except each zip file would need to be its own data frame to work with/loop through. Currently, all it seems to be doing is downloading all the names of the zip files, not the actual data.

Thank you

Asked By: ARE

||

Answers:

To open a zipfile and read the files there to a dataframe you can use next example:

import requests
import pandas as pd
from io import BytesIO
from zipfile import ZipFile

zip_url = "https://www.omie.es/es/file-download?parents%5B0%5D=marginalpdbc&filename=marginalpdbc_2017.zip"

dfs = []
with ZipFile(BytesIO(requests.get(zip_url).content)) as zf:
    for file in zf.namelist():
        df = pd.read_csv(
            zf.open(file),
            sep=";",
            skiprows=1,
            skipfooter=1,
            engine="python",
            header=None,
        )
        dfs.append(df)

final_df = pd.concat(dfs)

# print first 10 rows:
print(final_df.head(10).to_markdown(index=False))

Prints:

0 1 2 3 4 5 6
2017 1 1 1 58.82 58.82 nan
2017 1 1 2 58.23 58.23 nan
2017 1 1 3 51.95 51.95 nan
2017 1 1 4 47.27 47.27 nan
2017 1 1 5 46.9 45.49 nan
2017 1 1 6 46.6 44.5 nan
2017 1 1 7 46.25 44.5 nan
2017 1 1 8 46.1 44.72 nan
2017 1 1 9 46.1 44.22 nan
2017 1 1 10 45.13 45.13 nan
Answered By: Andrej Kesely
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.