Read and tabulate table-like data from website

Question:

I want to tabulate and store into Pandas this linked data from the U.S. Weather Service.

Here are the first four lines of the webpage.

Historic Crests

(1) 43.28 ft on 04/17/1979

(2) 39.58 ft on 05/25/1983

(3) 36.67 ft on 02/17/2020

You can access the data from an IDE or notebook using the following code.

import bs4
import urllib.request
link = "https://water.weather.gov/ahps2/crests.php?wfo=jan&gage=jacm6&crest_type=historic"

webpage=str(urllib.request.urlopen(link).read())
soup = bs4.BeautifulSoup(webpage)

print(soup.get_text())

The data is already in a table-like structure, and I think you could tabulate it by parsing it into a dictionary of lists and then uploading it into a Pandas Dataframe. However, I imagine that there is a more simple pythonic approach.

Here’s a snippet of the desired table structure.

No. Crest Date
1 43.28 04/17/1979
2 39.58 05/25/1983
3 36.67 02/17/2020
Asked By: Juancheeto

||

Answers:

You could create a regex pattern and feed it to Series.str.extractall. E.g.:

tbl = soup.find('div', class_='water_information')
vals = tbl.get_text().split(r'n')

df = pd.Series(vals).str.extractall(r'((?P<No>d+))s(?P<Crest>d+.d+)sftsons(?P<Date>d{2}/d{2}/d{4})')
    .reset_index(drop=True)

print(df)

    No  Crest        Date
0    1  43.28  04/17/1979
1    2  39.58  05/25/1983
2    3  36.67  02/17/2020
3    4  36.30  03/31/1902
4    5  36.30  12/05/1880
..  ..    ...         ...
86  87  30.30  04/20/1955
87  88  30.26  01/25/1983
88  89  30.26  04/07/1981
89  90  30.25  01/24/2020
90  91  30.23  12/23/1967

And maybe change the dtypes at this point:

df['No'] = df.No.astype(int)
df['Crest'] = df.Crest.astype(float)
df['Date'] = pd.to_datetime(df.Date)
Answered By: ouroboros1
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.