Reading Data from URL into a Pandas Dataframe
Question:
I have a URL that I am having difficulty reading. It is uncommon in the sense that it is data that I have self-generated or in other words have created using my own inputs. I have tried with other queries to use something like this and it works fine but not in this case:
bst = pd.read_csv('https://psl.noaa.gov/data/correlation/censo.data', skiprows=1,
skipfooter=2,index_col=[0], header=None,
engine='python', # c engine doesn't have skipfooter
delim_whitespace=True)
Here is the code + URL that is providing the challenge:
zwnd = pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?
ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
engine='python', # c engine doesn't have skipfooter
delim_whitespace=True)
Thank you for any help that you can provide.
Here is the full error message:
pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
engine='python', # c engine doesn't have skipfooter
delim_whitespace=True)
Traceback (most recent call last):
Cell In[240], line 1
pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
File ~Anaconda3envsStatslibsite-packagespandasutil_decorators.py:211 in wrapper
return func(*args, **kwargs)
File ~Anaconda3envsStatslibsite-packagespandasutil_decorators.py:331 in wrapper
return func(*args, **kwargs)
File ~Anaconda3envsStatslibsite-packagespandasioparsersreaders.py:950 in read_csv
return _read(filepath_or_buffer, kwds)
File ~Anaconda3envsStatslibsite-packagespandasioparsersreaders.py:611 in _read
return parser.read(nrows)
File ~Anaconda3envsStatslibsite-packagespandasioparsersreaders.py:1778 in read
) = self._engine.read( # type: ignore[attr-defined]
File ~Anaconda3envsStatslibsite-packagespandasioparserspython_parser.py:282 in read
alldata = self._rows_to_cols(content)
File ~Anaconda3envsStatslibsite-packagespandasioparserspython_parser.py:1045 in _rows_to_cols
self._alert_malformed(msg, row_num + 1)
File ~Anaconda3envsStatslibsite-packagespandasioparserspython_parser.py:765 in _alert_malformed
raise ParserError(msg)
ParserError: Expected 2 fields in line 133, saw 3. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Answers:
pd.read_csv
does not parse HTML. You might try pd.read_html
, but would find that it works on <table>
tags, not <pre>
tags.
On inspecting the HTML content of the given URL, it is evident that the data is contained in a <pre>
tag.
Use something like requests
to get the page content, and BeautifulSoup4
to parse the HTML page contents (with an appropriate parsing engine, either lxml
or html5lib
). Then pull out the content of the <pre>
tag, splitting on newlines, slicing to ignore unwanted lines, and then splitting on whitespace.
Minimal working code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries'
res = requests.get(url)
# get the text from the 'pre' tag, split it on newlines
# slice off 1 head and 5 tail rows
# (inspect the contents of 'soup.find('pre').text' to determine correct values)
soup = BeautifulSoup(res.content, "html5lib")
data = soup.find('pre').text.split("n")[1:-5]
df = pd.DataFrame([row.split() for row in data]).apply(pd.to_numeric)
df = df.set_index(df.iloc[:,0])
results in
>>> print(df.head(5))
0 1 2 3 4 5 6 7 8 9 10 11 12
0
1948 1948 0.878 0.779 0.851 0.393 0.461 0.747 0.867 0.539 -0.106 0.045 0.819 1.506
1949 1949 0.386 1.197 1.154 1.054 0.358 0.645 0.643 0.477 0.128 -0.091 1.500 0.390
1950 1950 0.674 0.973 1.640 0.821 0.572 1.002 0.635 0.196 -0.020 0.268 0.844 1.045
1951 1951 1.524 0.698 0.971 0.790 0.789 0.587 0.682 0.238 0.256 0.035 0.906 1.268
1952 1952 1.524 1.510 1.353 0.705 0.710 1.188 0.412 0.432 -0.091 0.415 0.443 1.509
and
>>> print(df.dtypes)
0 int64
1 float64
2 float64
...
12 float64
This answer is a good starting point for what you’re trying to accomplish.
Its because the first one directly points to a dataset from storage in .data format but the second url points to a website (which is made up of html, css, json, etc files). You can only use pd.read_csv if you are parsing in a .csv file, and i guess a .data file too since it worked for you.
If you can find a link to the actual .data or .csv file on that website you will be able to parse it no problem. Since its a gov website, they probably will have a good file format.
If you cannot, and still need this data you will have to do some webscraping from that website (like using selenium), then you will need to store them as dataframes, and maybe preprocess it so it gets added like expected.
I have a URL that I am having difficulty reading. It is uncommon in the sense that it is data that I have self-generated or in other words have created using my own inputs. I have tried with other queries to use something like this and it works fine but not in this case:
bst = pd.read_csv('https://psl.noaa.gov/data/correlation/censo.data', skiprows=1,
skipfooter=2,index_col=[0], header=None,
engine='python', # c engine doesn't have skipfooter
delim_whitespace=True)
Here is the code + URL that is providing the challenge:
zwnd = pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?
ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
engine='python', # c engine doesn't have skipfooter
delim_whitespace=True)
Thank you for any help that you can provide.
Here is the full error message:
pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
engine='python', # c engine doesn't have skipfooter
delim_whitespace=True)
Traceback (most recent call last):
Cell In[240], line 1
pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
File ~Anaconda3envsStatslibsite-packagespandasutil_decorators.py:211 in wrapper
return func(*args, **kwargs)
File ~Anaconda3envsStatslibsite-packagespandasutil_decorators.py:331 in wrapper
return func(*args, **kwargs)
File ~Anaconda3envsStatslibsite-packagespandasioparsersreaders.py:950 in read_csv
return _read(filepath_or_buffer, kwds)
File ~Anaconda3envsStatslibsite-packagespandasioparsersreaders.py:611 in _read
return parser.read(nrows)
File ~Anaconda3envsStatslibsite-packagespandasioparsersreaders.py:1778 in read
) = self._engine.read( # type: ignore[attr-defined]
File ~Anaconda3envsStatslibsite-packagespandasioparserspython_parser.py:282 in read
alldata = self._rows_to_cols(content)
File ~Anaconda3envsStatslibsite-packagespandasioparserspython_parser.py:1045 in _rows_to_cols
self._alert_malformed(msg, row_num + 1)
File ~Anaconda3envsStatslibsite-packagespandasioparserspython_parser.py:765 in _alert_malformed
raise ParserError(msg)
ParserError: Expected 2 fields in line 133, saw 3. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
pd.read_csv
does not parse HTML. You might try pd.read_html
, but would find that it works on <table>
tags, not <pre>
tags.
On inspecting the HTML content of the given URL, it is evident that the data is contained in a <pre>
tag.
Use something like requests
to get the page content, and BeautifulSoup4
to parse the HTML page contents (with an appropriate parsing engine, either lxml
or html5lib
). Then pull out the content of the <pre>
tag, splitting on newlines, slicing to ignore unwanted lines, and then splitting on whitespace.
Minimal working code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries'
res = requests.get(url)
# get the text from the 'pre' tag, split it on newlines
# slice off 1 head and 5 tail rows
# (inspect the contents of 'soup.find('pre').text' to determine correct values)
soup = BeautifulSoup(res.content, "html5lib")
data = soup.find('pre').text.split("n")[1:-5]
df = pd.DataFrame([row.split() for row in data]).apply(pd.to_numeric)
df = df.set_index(df.iloc[:,0])
results in
>>> print(df.head(5))
0 1 2 3 4 5 6 7 8 9 10 11 12
0
1948 1948 0.878 0.779 0.851 0.393 0.461 0.747 0.867 0.539 -0.106 0.045 0.819 1.506
1949 1949 0.386 1.197 1.154 1.054 0.358 0.645 0.643 0.477 0.128 -0.091 1.500 0.390
1950 1950 0.674 0.973 1.640 0.821 0.572 1.002 0.635 0.196 -0.020 0.268 0.844 1.045
1951 1951 1.524 0.698 0.971 0.790 0.789 0.587 0.682 0.238 0.256 0.035 0.906 1.268
1952 1952 1.524 1.510 1.353 0.705 0.710 1.188 0.412 0.432 -0.091 0.415 0.443 1.509
and
>>> print(df.dtypes)
0 int64
1 float64
2 float64
...
12 float64
This answer is a good starting point for what you’re trying to accomplish.
Its because the first one directly points to a dataset from storage in .data format but the second url points to a website (which is made up of html, css, json, etc files). You can only use pd.read_csv if you are parsing in a .csv file, and i guess a .data file too since it worked for you.
If you can find a link to the actual .data or .csv file on that website you will be able to parse it no problem. Since its a gov website, they probably will have a good file format.
If you cannot, and still need this data you will have to do some webscraping from that website (like using selenium), then you will need to store them as dataframes, and maybe preprocess it so it gets added like expected.