Reading Data from URL into a Pandas Dataframe

Question:

I have a URL that I am having difficulty reading. It is uncommon in the sense that it is data that I have self-generated or in other words have created using my own inputs. I have tried with other queries to use something like this and it works fine but not in this case:

bst = pd.read_csv('https://psl.noaa.gov/data/correlation/censo.data', skiprows=1, 
skipfooter=2,index_col=[0], header=None,
             engine='python', # c engine doesn't have skipfooter
             delim_whitespace=True)

Here is the code + URL that is providing the challenge:

zwnd = pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl? 
ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
                 engine='python', # c engine doesn't have skipfooter
                 delim_whitespace=True)

Thank you for any help that you can provide.

Here is the full error message:

pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
                 engine='python', # c engine doesn't have skipfooter
                 delim_whitespace=True)
Traceback (most recent call last):

  Cell In[240], line 1
    pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,

  File ~Anaconda3envsStatslibsite-packagespandasutil_decorators.py:211 in wrapper
    return func(*args, **kwargs)

  File ~Anaconda3envsStatslibsite-packagespandasutil_decorators.py:331 in wrapper
    return func(*args, **kwargs)

  File ~Anaconda3envsStatslibsite-packagespandasioparsersreaders.py:950 in read_csv
    return _read(filepath_or_buffer, kwds)

  File ~Anaconda3envsStatslibsite-packagespandasioparsersreaders.py:611 in _read
    return parser.read(nrows)

  File ~Anaconda3envsStatslibsite-packagespandasioparsersreaders.py:1778 in read
    ) = self._engine.read(  # type: ignore[attr-defined]

  File ~Anaconda3envsStatslibsite-packagespandasioparserspython_parser.py:282 in read
    alldata = self._rows_to_cols(content)

  File ~Anaconda3envsStatslibsite-packagespandasioparserspython_parser.py:1045 in _rows_to_cols
    self._alert_malformed(msg, row_num + 1)

  File ~Anaconda3envsStatslibsite-packagespandasioparserspython_parser.py:765 in _alert_malformed
    raise ParserError(msg)

ParserError: Expected 2 fields in line 133, saw 3. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Asked By: user2100039

||

Answers:

pd.read_csv does not parse HTML. You might try pd.read_html, but would find that it works on <table> tags, not <pre> tags.

On inspecting the HTML content of the given URL, it is evident that the data is contained in a <pre> tag.

Use something like requests to get the page content, and BeautifulSoup4 to parse the HTML page contents (with an appropriate parsing engine, either lxml or html5lib). Then pull out the content of the <pre> tag, splitting on newlines, slicing to ignore unwanted lines, and then splitting on whitespace.


Minimal working code:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries'
res = requests.get(url)

# get the text from the 'pre' tag, split it on newlines
# slice off 1 head and 5 tail rows
# (inspect the contents of 'soup.find('pre').text' to determine correct values)
soup = BeautifulSoup(res.content, "html5lib")
data = soup.find('pre').text.split("n")[1:-5]

df = pd.DataFrame([row.split() for row in data]).apply(pd.to_numeric)
df = df.set_index(df.iloc[:,0])

results in

>>> print(df.head(5))
        0      1      2      3      4      5      6      7      8      9      10     11     12
0
1948  1948  0.878  0.779  0.851  0.393  0.461  0.747  0.867  0.539 -0.106  0.045  0.819  1.506
1949  1949  0.386  1.197  1.154  1.054  0.358  0.645  0.643  0.477  0.128 -0.091  1.500  0.390
1950  1950  0.674  0.973  1.640  0.821  0.572  1.002  0.635  0.196 -0.020  0.268  0.844  1.045
1951  1951  1.524  0.698  0.971  0.790  0.789  0.587  0.682  0.238  0.256  0.035  0.906  1.268
1952  1952  1.524  1.510  1.353  0.705  0.710  1.188  0.412  0.432 -0.091  0.415  0.443  1.509

and

>>> print(df.dtypes)
0       int64
1     float64
2     float64
...
12    float64

This answer is a good starting point for what you’re trying to accomplish.

Answered By: Joshua Voskamp

Its because the first one directly points to a dataset from storage in .data format but the second url points to a website (which is made up of html, css, json, etc files). You can only use pd.read_csv if you are parsing in a .csv file, and i guess a .data file too since it worked for you.


If you can find a link to the actual .data or .csv file on that website you will be able to parse it no problem. Since its a gov website, they probably will have a good file format.


If you cannot, and still need this data you will have to do some webscraping from that website (like using selenium), then you will need to store them as dataframes, and maybe preprocess it so it gets added like expected.

Answered By: Raie
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.