How to scrape zip files into a single dataframe in python
Question:
I am very new to web scrapping and I am trying to understand how I can scrape all the zip files and regular files that are on this website. The end goal is the scrape all the data, I was originally thinking I could use pd.read_html and feed in a list of each link and loop through each zip file.
I am very new to web scraping so any help at all would be very useful, I have tried a few examples this far please see the below code
import pandas as pd
pd.read_html("https://www.omie.es/en/file-access-list?parents%5B0%5D=/&parents%5B1%5D=Day-ahead%20Market&parents%5B2%5D=1.%20Prices&dir=%20Day-ahead%20market%20hourly%20prices%20in%20Spain&realdir=marginalpdbc",match="marginalpdbc_2017.zip")
So this is what I would like the output to look like except each zip file would need to be its own data frame to work with/loop through. Currently, all it seems to be doing is downloading all the names of the zip files, not the actual data.
Thank you
Answers:
To open a zipfile and read the files there to a dataframe you can use next example:
import requests
import pandas as pd
from io import BytesIO
from zipfile import ZipFile
zip_url = "https://www.omie.es/es/file-download?parents%5B0%5D=marginalpdbc&filename=marginalpdbc_2017.zip"
dfs = []
with ZipFile(BytesIO(requests.get(zip_url).content)) as zf:
for file in zf.namelist():
df = pd.read_csv(
zf.open(file),
sep=";",
skiprows=1,
skipfooter=1,
engine="python",
header=None,
)
dfs.append(df)
final_df = pd.concat(dfs)
# print first 10 rows:
print(final_df.head(10).to_markdown(index=False))
Prints:
0
1
2
3
4
5
6
2017
1
1
1
58.82
58.82
nan
2017
1
1
2
58.23
58.23
nan
2017
1
1
3
51.95
51.95
nan
2017
1
1
4
47.27
47.27
nan
2017
1
1
5
46.9
45.49
nan
2017
1
1
6
46.6
44.5
nan
2017
1
1
7
46.25
44.5
nan
2017
1
1
8
46.1
44.72
nan
2017
1
1
9
46.1
44.22
nan
2017
1
1
10
45.13
45.13
nan
I am very new to web scrapping and I am trying to understand how I can scrape all the zip files and regular files that are on this website. The end goal is the scrape all the data, I was originally thinking I could use pd.read_html and feed in a list of each link and loop through each zip file.
I am very new to web scraping so any help at all would be very useful, I have tried a few examples this far please see the below code
import pandas as pd
pd.read_html("https://www.omie.es/en/file-access-list?parents%5B0%5D=/&parents%5B1%5D=Day-ahead%20Market&parents%5B2%5D=1.%20Prices&dir=%20Day-ahead%20market%20hourly%20prices%20in%20Spain&realdir=marginalpdbc",match="marginalpdbc_2017.zip")
So this is what I would like the output to look like except each zip file would need to be its own data frame to work with/loop through. Currently, all it seems to be doing is downloading all the names of the zip files, not the actual data.
Thank you
To open a zipfile and read the files there to a dataframe you can use next example:
import requests
import pandas as pd
from io import BytesIO
from zipfile import ZipFile
zip_url = "https://www.omie.es/es/file-download?parents%5B0%5D=marginalpdbc&filename=marginalpdbc_2017.zip"
dfs = []
with ZipFile(BytesIO(requests.get(zip_url).content)) as zf:
for file in zf.namelist():
df = pd.read_csv(
zf.open(file),
sep=";",
skiprows=1,
skipfooter=1,
engine="python",
header=None,
)
dfs.append(df)
final_df = pd.concat(dfs)
# print first 10 rows:
print(final_df.head(10).to_markdown(index=False))
Prints:
0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
2017 | 1 | 1 | 1 | 58.82 | 58.82 | nan |
2017 | 1 | 1 | 2 | 58.23 | 58.23 | nan |
2017 | 1 | 1 | 3 | 51.95 | 51.95 | nan |
2017 | 1 | 1 | 4 | 47.27 | 47.27 | nan |
2017 | 1 | 1 | 5 | 46.9 | 45.49 | nan |
2017 | 1 | 1 | 6 | 46.6 | 44.5 | nan |
2017 | 1 | 1 | 7 | 46.25 | 44.5 | nan |
2017 | 1 | 1 | 8 | 46.1 | 44.72 | nan |
2017 | 1 | 1 | 9 | 46.1 | 44.22 | nan |
2017 | 1 | 1 | 10 | 45.13 | 45.13 | nan |