Tbody data showing "None" when scraped?
Question:
Last time I ran into this issue, adding the Header info fixed — doesn’t seem to be the case here. Trying different methods, but ultimately my goal is to scrape the info from all of the tables on each of the links listed.
It’s coming up as tbody data — specifically class: table-responsive.xs (I think).
I’ve tried taking all of the tbody data, and also just this class, but I’m not getting any result except a list of "none" values.
Is there another approach? I hoped adding the Header in was the solution, but doesn’t seem to be.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
profiles = []
session = HTMLSession()
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
urls = [
'https://magicseaweed.com/New-Jersey-Monmouth-County-Surfing/277/',
'https://magicseaweed.com/New-Jersey-Ocean-City-Surfing/279/'
]
for url in urls:
r = session.get(url)
# wait for 3s until the page fully loaded
r.html.render(sleep=3, timeout=20)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
for profile in soup.find_all('div', attrs={"class": "table-responsive.xs"}):
profiles.append(profile)
for p in profiles:
print(p)
Also tried:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
profiles = []
session = HTMLSession()
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
urls = [
'https://magicseaweed.com/New-Jersey-Monmouth-County-Surfing/277/',
'https://magicseaweed.com/New-Jersey-Ocean-City-Surfing/279/'
]
for url in urls:
r = session.get(url)
# wait for 3s until the page fully loaded
r.html.render(sleep=3, timeout=20)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
for profile in soup.find_all('a'):
profile = profile.get('tbody')
profiles.append(profile)
for p in profiles:
print(p)
Lastly –
With someone’s great guidance here, I am separately able to pull the full json data with this script below:
import requests
import pandas as pd
import json
r = requests.get('https://magicseaweed.com/api/mdkey/spot?&limit=-1')
df = pd.DataFrame(r.json()).to_csv('out.csv', index=False)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
print(df)
However, being that I live in NJ, I only really care about the NJ waves. I used an HREF scrape to get the URLs I’d like to see data for. Ideally, I could pull a week’s worth of info, but if the day is the only possible option, I’ll survive.
I tried including an if statement that only focuses on specific URLs (it is in the JSON data), but not having luck. Ultimately I want to add an OR to include all of the links listed, unless someone has a better idea?
I know I could easily match them once extracted, but I don’t want to run 9,000 rows every time, when I only need a select few.
import requests
import pandas as pd
import json
r = requests.get('https://magicseaweed.com/api/mdkey/spot?&limit=-1')
df = pd.DataFrame(r.json()).to_csv('out.csv', index=False)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
for d in df:
if d and '/Belmar-Surf-Report/3683' in df:
print(d)
# '/Belmar-Surf-Report/3683'
# '/Manasquan-Surf-Report/386/'
# '/Ocean-Grove-Surf-Report/7945/'
# '/Asbury-Park-Surf-Report/857/'
# '/Avon-Surf-Report/4050/'
# '/Bay-Head-Surf-Report/4951/'
# '/Belmar-Surf-Report/3683/'
# '/Boardwalk-Surf-Report/9183/'
# '/Bradley-Beach-Surf-Report/7944/'
# '/Casino-Surf-Report/9175/'
# '/Deal-Surf-Report/822/'
# '/Dog-Park-Surf-Report/9174/'
# '/Jenkinsons-Surf-Report/4053/'
# '/Long-Branch-Surf-Report/7946/'
# '/Long-Branch-Surf-Report/7947/'
# '/Manasquan-Surf-Report/386/'
# '/Monmouth-Beach-Surf-Report/4055/'
# '/Ocean-Grove-Surf-Report/7945/'
# '/Point-Pleasant-Surf-Report/7942/'
# '/Sea-Girt-Surf-Report/7943/'
# '/Spring-Lake-Surf-Report/7941/'
# '/The-Cove-Surf-Report/385/'
# '/Belmar-Surf-Report/3683/'
# '/Avon-Surf-Report/4050/'
# '/Deal-Surf-Report/822/'
# '/North-Street-Surf-Report/4946/'
# '/Margate-Pier-Surf-Report/4054/'
# '/Ocean-City-NJ-Surf-Report/391/'
# '/7th-St-Surf-Report/7918/'
# '/Brigantine-Surf-Report/4747/'
# '/Brigantine-Seawall-Surf-Report/4942/'
# '/Crystals-Surf-Report/4943/'
# '/Longport-32nd-St-Surf-Report/1158/'
# '/Margate-Pier-Surf-Report/4054/'
# '/North-Street-Surf-Report/4946/'
# '/Ocean-City-NJ-Surf-Report/391/'
# '/South-Carolina-Ave-Surf-Report/4944/'
# '/St-James-Surf-Report/7917/'
# '/States-Avenue-Surf-Report/390/'
# '/Ventnor-Pier-Surf-Report/4945/'
# '/14th-Street-Surf-Report/9055/'
# '/18th-St-Surf-Report/9056/'
# '/30th-St-Surf-Report/9057/'
# '/56th-St-Surf-Report/9059/'
# '/Diamond-Beach-Surf-Report/9061/'
# '/Strathmere-Surf-Report/7919/'
# '/The-Cove-Surf-Report/7921/'
# '/14th-Street-Surf-Report/9055/'
# '/18th-St-Surf-Report/9056/'
# '/30th-St-Surf-Report/9057/'
# '/56th-St-Surf-Report/9059/'
# '/Avalon-Surf-Report/821/'
# '/Diamond-Beach-Surf-Report/9061/'
# '/Nuns-Beach-Surf-Report/7948/'
# '/Poverty-Beach-Surf-Report/4056/'
# '/Sea-Isle-City-Surf-Report/1281/'
# '/Stockton-Surf-Report/393/'
# '/Stone-Harbor-Surf-Report/7920/'
# '/Strathmere-Surf-Report/7919/'
# '/The-Cove-Surf-Report/7921/'
# '/Wildwood-Surf-Report/392/'
//or can use the SurfIDs:
3683
386
7945
857
4050
4951
3683
9183
7944
9175
822
9174
4053
7946
7947
386
4055
7945
7942
7943
7941
385
3683
4050
822
4946
4054
391
7918
4747
4942
4943
1158
4054
4946
391
4944
7917
390
4945
9055
9056
9057
9059
9061
7919
7921
9055
9056
9057
9059
821
9061
7948
4056
1281
393
7920
7919
7921
392
Answers:
EDIT: Given you confirmed your links list (and they remain static, do not change), you can check all of them daily like this:
import requests
import pandas as pd
from bs4 import BeautifulSoup
id_list = [
'/Belmar-Surf-Report/3683',
'/Manasquan-Surf-Report/386/',
'/Ocean-Grove-Surf-Report/7945/',
'/Asbury-Park-Surf-Report/857/',
'/Avon-Surf-Report/4050/',
'/Bay-Head-Surf-Report/4951/',
'/Belmar-Surf-Report/3683/',
'/Boardwalk-Surf-Report/9183/',
'/Bradley-Beach-Surf-Report/7944/',
'/Casino-Surf-Report/9175/',
'/Deal-Surf-Report/822/',
'/Dog-Park-Surf-Report/9174/',
'/Jenkinsons-Surf-Report/4053/',
'/Long-Branch-Surf-Report/7946/',
'/Long-Branch-Surf-Report/7947/',
'/Manasquan-Surf-Report/386/',
'/Monmouth-Beach-Surf-Report/4055/',
'/Ocean-Grove-Surf-Report/7945/',
'/Point-Pleasant-Surf-Report/7942/',
'/Sea-Girt-Surf-Report/7943/',
'/Spring-Lake-Surf-Report/7941/',
'/The-Cove-Surf-Report/385/',
'/Belmar-Surf-Report/3683/',
'/Avon-Surf-Report/4050/',
'/Deal-Surf-Report/822/',
'/North-Street-Surf-Report/4946/',
'/Margate-Pier-Surf-Report/4054/',
'/Ocean-City-NJ-Surf-Report/391/',
'/7th-St-Surf-Report/7918/',
'/Brigantine-Surf-Report/4747/',
'/Brigantine-Seawall-Surf-Report/4942/',
'/Crystals-Surf-Report/4943/',
'/Longport-32nd-St-Surf-Report/1158/',
'/Margate-Pier-Surf-Report/4054/',
'/North-Street-Surf-Report/4946/',
'/Ocean-City-NJ-Surf-Report/391/',
'/South-Carolina-Ave-Surf-Report/4944/',
'/St-James-Surf-Report/7917/',
'/States-Avenue-Surf-Report/390/',
'/Ventnor-Pier-Surf-Report/4945/',
'/14th-Street-Surf-Report/9055/',
'/18th-St-Surf-Report/9056/',
'/30th-St-Surf-Report/9057/',
'/56th-St-Surf-Report/9059/',
'/Diamond-Beach-Surf-Report/9061/',
'/Strathmere-Surf-Report/7919/',
'/The-Cove-Surf-Report/7921/',
'/14th-Street-Surf-Report/9055/',
'/18th-St-Surf-Report/9056/',
'/30th-St-Surf-Report/9057/',
'/56th-St-Surf-Report/9059/',
'/Avalon-Surf-Report/821/',
'/Diamond-Beach-Surf-Report/9061/',
'/Nuns-Beach-Surf-Report/7948/',
'/Poverty-Beach-Surf-Report/4056/',
'/Sea-Isle-City-Surf-Report/1281/',
'/Stockton-Surf-Report/393/',
'/Stone-Harbor-Surf-Report/7920/',
'/Strathmere-Surf-Report/7919/',
'/The-Cove-Surf-Report/7921/',
'/Wildwood-Surf-Report/392/'
]
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
for x in id_list:
url = 'https://magicseaweed.com' + x
r = requests.get(url, headers=headers)
try:
soup = BeautifulSoup(r.text, 'html.parser')
dfs = pd.read_html(str(soup))
for df in dfs:
print(df)
if df.shape[0] > 50:
df.to_csv(f"{x.replace('/', '_').replace('-', '_')}.csv")
print('____________')
except Exception as e:
print(x, e)
This return several dataframes for each page, some more, some less, and saves the ones with more than 50 rows:
0 1 2
0 Low 12:24AM -0.05m
1 High 6:25AM 1.28m
2 Low 12:28PM -0.01m
3 High 6:49PM 1.66m
____________
0 1
0 First Light 5:36AM
1 Sunrise 6:05AM
2 Sunset 8:00PM
3 Last Light 8:30PM
____________
Unnamed: 0 Surf Swell Rating Primary Swell Primary Swell.1 Primary Swell.2 Secondary Swell Secondary Swell.1 Secondary Swell.2 Secondary Swell.3 ... Wind Wind.1 Weather Weather.1 Prob. Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21
0 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 ... Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08
1 12am 0.5-0.8m NaN 0.9m 6s NaN 0.5m 9s NaN NaN ... 11 11 kph NaN NaN 26°c NaN NaN NaN NaN NaN NaN
2 3am 0.3-0.5m NaN 0.5m 9s NaN 0.8m 6s NaN NaN ... 13 17 kph NaN NaN 24°c NaN NaN NaN NaN NaN NaN
3 6am 0.2-0.3m NaN 0.5m 9s NaN 0.7m 6s NaN NaN ... 12 16 kph NaN NaN 24°c NaN NaN NaN NaN NaN NaN
4 9am 0.3-0.6m NaN 0.5m 9s NaN 0.7m 6s NaN NaN ... 13 16 kph NaN NaN 25°c NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
121 High 11:57PM 1.34m NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
122 First Light 5:42AM NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
123 Sunrise 6:10AM NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
124 Sunset 7:53PM NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
125 Last Light 8:21PM NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
126 rows × 22 columns
____________
0 1 2
0 Low 12:24AM -0.05m
1 High 6:25AM 1.28m
2 Low 12:28PM -0.01m
3 High 6:49PM 1.66m
____________
0 1
0 First Light 5:36AM
1 Sunrise 6:05AM
2 Sunset 8:00PM
3 Last Light 8:30PM
____________
0 1 2
0 Low 1:19AM -0.13m
1 High 7:21AM 1.37m
2 Low 1:26PM -0.06m
3 High 7:43PM 1.7m
____________
0 1
0 First Light 5:37AM
1 Sunrise 6:06AM
2 Sunset 7:59PM
3 Last Light 8:28PM
____________
0 1 2
0 Low 2:11AM -0.18m
1 High 8:14AM 1.43m
2 Low 2:21PM -0.09m
3 High 8:34PM 1.69m
____________
0 1
0 First Light 5:38AM
1 Sunrise 6:07AM
2 Sunset 7:58PM
3 Last Light 8:27PM
____________
0 1 2
0 Low 2:59AM -0.21m
1 High 9:05AM 1.47m
2 Low 3:13PM -0.09m
3 High 9:24PM 1.64m
____________
0 1
0 First Light 5:39AM
1 Sunrise 6:08AM
2 Sunset 7:57PM
3 Last Light 8:25PM
____________
0 1 2
0 Low 3:46AM -0.2m
1 High 9:57AM 1.47m
2 Low 4:03PM -0.06m
3 High 10:14PM 1.56m
____________
0 1
0 First Light 5:40AM
1 Sunrise 6:09AM
2 Sunset 7:55PM
3 Last Light 8:24PM
____________
0 1 2
0 Low 4:29AM -0.15m
1 High 10:48AM 1.46m
2 Low 4:52PM 0.01m
3 High 11:05PM 1.46m
____________
0 1
0 First Light 5:41AM
1 Sunrise 6:10AM
2 Sunset 7:54PM
3 Last Light 8:23PM
____________
0 1 2
0 Low 5:12AM -0.07m
1 High 11:39AM 1.43m
2 Low 5:42PM 0.1m
3 High 11:57PM 1.34m
____________
0 1
0 First Light 5:42AM
1 Sunrise 6:10AM
2 Sunset 7:53PM
3 Last Light 8:21PM
Last time I ran into this issue, adding the Header info fixed — doesn’t seem to be the case here. Trying different methods, but ultimately my goal is to scrape the info from all of the tables on each of the links listed.
It’s coming up as tbody data — specifically class: table-responsive.xs (I think).
I’ve tried taking all of the tbody data, and also just this class, but I’m not getting any result except a list of "none" values.
Is there another approach? I hoped adding the Header in was the solution, but doesn’t seem to be.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
profiles = []
session = HTMLSession()
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
urls = [
'https://magicseaweed.com/New-Jersey-Monmouth-County-Surfing/277/',
'https://magicseaweed.com/New-Jersey-Ocean-City-Surfing/279/'
]
for url in urls:
r = session.get(url)
# wait for 3s until the page fully loaded
r.html.render(sleep=3, timeout=20)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
for profile in soup.find_all('div', attrs={"class": "table-responsive.xs"}):
profiles.append(profile)
for p in profiles:
print(p)
Also tried:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
profiles = []
session = HTMLSession()
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
urls = [
'https://magicseaweed.com/New-Jersey-Monmouth-County-Surfing/277/',
'https://magicseaweed.com/New-Jersey-Ocean-City-Surfing/279/'
]
for url in urls:
r = session.get(url)
# wait for 3s until the page fully loaded
r.html.render(sleep=3, timeout=20)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
for profile in soup.find_all('a'):
profile = profile.get('tbody')
profiles.append(profile)
for p in profiles:
print(p)
Lastly –
With someone’s great guidance here, I am separately able to pull the full json data with this script below:
import requests
import pandas as pd
import json
r = requests.get('https://magicseaweed.com/api/mdkey/spot?&limit=-1')
df = pd.DataFrame(r.json()).to_csv('out.csv', index=False)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
print(df)
However, being that I live in NJ, I only really care about the NJ waves. I used an HREF scrape to get the URLs I’d like to see data for. Ideally, I could pull a week’s worth of info, but if the day is the only possible option, I’ll survive.
I tried including an if statement that only focuses on specific URLs (it is in the JSON data), but not having luck. Ultimately I want to add an OR to include all of the links listed, unless someone has a better idea?
I know I could easily match them once extracted, but I don’t want to run 9,000 rows every time, when I only need a select few.
import requests
import pandas as pd
import json
r = requests.get('https://magicseaweed.com/api/mdkey/spot?&limit=-1')
df = pd.DataFrame(r.json()).to_csv('out.csv', index=False)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
for d in df:
if d and '/Belmar-Surf-Report/3683' in df:
print(d)
# '/Belmar-Surf-Report/3683'
# '/Manasquan-Surf-Report/386/'
# '/Ocean-Grove-Surf-Report/7945/'
# '/Asbury-Park-Surf-Report/857/'
# '/Avon-Surf-Report/4050/'
# '/Bay-Head-Surf-Report/4951/'
# '/Belmar-Surf-Report/3683/'
# '/Boardwalk-Surf-Report/9183/'
# '/Bradley-Beach-Surf-Report/7944/'
# '/Casino-Surf-Report/9175/'
# '/Deal-Surf-Report/822/'
# '/Dog-Park-Surf-Report/9174/'
# '/Jenkinsons-Surf-Report/4053/'
# '/Long-Branch-Surf-Report/7946/'
# '/Long-Branch-Surf-Report/7947/'
# '/Manasquan-Surf-Report/386/'
# '/Monmouth-Beach-Surf-Report/4055/'
# '/Ocean-Grove-Surf-Report/7945/'
# '/Point-Pleasant-Surf-Report/7942/'
# '/Sea-Girt-Surf-Report/7943/'
# '/Spring-Lake-Surf-Report/7941/'
# '/The-Cove-Surf-Report/385/'
# '/Belmar-Surf-Report/3683/'
# '/Avon-Surf-Report/4050/'
# '/Deal-Surf-Report/822/'
# '/North-Street-Surf-Report/4946/'
# '/Margate-Pier-Surf-Report/4054/'
# '/Ocean-City-NJ-Surf-Report/391/'
# '/7th-St-Surf-Report/7918/'
# '/Brigantine-Surf-Report/4747/'
# '/Brigantine-Seawall-Surf-Report/4942/'
# '/Crystals-Surf-Report/4943/'
# '/Longport-32nd-St-Surf-Report/1158/'
# '/Margate-Pier-Surf-Report/4054/'
# '/North-Street-Surf-Report/4946/'
# '/Ocean-City-NJ-Surf-Report/391/'
# '/South-Carolina-Ave-Surf-Report/4944/'
# '/St-James-Surf-Report/7917/'
# '/States-Avenue-Surf-Report/390/'
# '/Ventnor-Pier-Surf-Report/4945/'
# '/14th-Street-Surf-Report/9055/'
# '/18th-St-Surf-Report/9056/'
# '/30th-St-Surf-Report/9057/'
# '/56th-St-Surf-Report/9059/'
# '/Diamond-Beach-Surf-Report/9061/'
# '/Strathmere-Surf-Report/7919/'
# '/The-Cove-Surf-Report/7921/'
# '/14th-Street-Surf-Report/9055/'
# '/18th-St-Surf-Report/9056/'
# '/30th-St-Surf-Report/9057/'
# '/56th-St-Surf-Report/9059/'
# '/Avalon-Surf-Report/821/'
# '/Diamond-Beach-Surf-Report/9061/'
# '/Nuns-Beach-Surf-Report/7948/'
# '/Poverty-Beach-Surf-Report/4056/'
# '/Sea-Isle-City-Surf-Report/1281/'
# '/Stockton-Surf-Report/393/'
# '/Stone-Harbor-Surf-Report/7920/'
# '/Strathmere-Surf-Report/7919/'
# '/The-Cove-Surf-Report/7921/'
# '/Wildwood-Surf-Report/392/'
//or can use the SurfIDs:
3683
386
7945
857
4050
4951
3683
9183
7944
9175
822
9174
4053
7946
7947
386
4055
7945
7942
7943
7941
385
3683
4050
822
4946
4054
391
7918
4747
4942
4943
1158
4054
4946
391
4944
7917
390
4945
9055
9056
9057
9059
9061
7919
7921
9055
9056
9057
9059
821
9061
7948
4056
1281
393
7920
7919
7921
392
EDIT: Given you confirmed your links list (and they remain static, do not change), you can check all of them daily like this:
import requests
import pandas as pd
from bs4 import BeautifulSoup
id_list = [
'/Belmar-Surf-Report/3683',
'/Manasquan-Surf-Report/386/',
'/Ocean-Grove-Surf-Report/7945/',
'/Asbury-Park-Surf-Report/857/',
'/Avon-Surf-Report/4050/',
'/Bay-Head-Surf-Report/4951/',
'/Belmar-Surf-Report/3683/',
'/Boardwalk-Surf-Report/9183/',
'/Bradley-Beach-Surf-Report/7944/',
'/Casino-Surf-Report/9175/',
'/Deal-Surf-Report/822/',
'/Dog-Park-Surf-Report/9174/',
'/Jenkinsons-Surf-Report/4053/',
'/Long-Branch-Surf-Report/7946/',
'/Long-Branch-Surf-Report/7947/',
'/Manasquan-Surf-Report/386/',
'/Monmouth-Beach-Surf-Report/4055/',
'/Ocean-Grove-Surf-Report/7945/',
'/Point-Pleasant-Surf-Report/7942/',
'/Sea-Girt-Surf-Report/7943/',
'/Spring-Lake-Surf-Report/7941/',
'/The-Cove-Surf-Report/385/',
'/Belmar-Surf-Report/3683/',
'/Avon-Surf-Report/4050/',
'/Deal-Surf-Report/822/',
'/North-Street-Surf-Report/4946/',
'/Margate-Pier-Surf-Report/4054/',
'/Ocean-City-NJ-Surf-Report/391/',
'/7th-St-Surf-Report/7918/',
'/Brigantine-Surf-Report/4747/',
'/Brigantine-Seawall-Surf-Report/4942/',
'/Crystals-Surf-Report/4943/',
'/Longport-32nd-St-Surf-Report/1158/',
'/Margate-Pier-Surf-Report/4054/',
'/North-Street-Surf-Report/4946/',
'/Ocean-City-NJ-Surf-Report/391/',
'/South-Carolina-Ave-Surf-Report/4944/',
'/St-James-Surf-Report/7917/',
'/States-Avenue-Surf-Report/390/',
'/Ventnor-Pier-Surf-Report/4945/',
'/14th-Street-Surf-Report/9055/',
'/18th-St-Surf-Report/9056/',
'/30th-St-Surf-Report/9057/',
'/56th-St-Surf-Report/9059/',
'/Diamond-Beach-Surf-Report/9061/',
'/Strathmere-Surf-Report/7919/',
'/The-Cove-Surf-Report/7921/',
'/14th-Street-Surf-Report/9055/',
'/18th-St-Surf-Report/9056/',
'/30th-St-Surf-Report/9057/',
'/56th-St-Surf-Report/9059/',
'/Avalon-Surf-Report/821/',
'/Diamond-Beach-Surf-Report/9061/',
'/Nuns-Beach-Surf-Report/7948/',
'/Poverty-Beach-Surf-Report/4056/',
'/Sea-Isle-City-Surf-Report/1281/',
'/Stockton-Surf-Report/393/',
'/Stone-Harbor-Surf-Report/7920/',
'/Strathmere-Surf-Report/7919/',
'/The-Cove-Surf-Report/7921/',
'/Wildwood-Surf-Report/392/'
]
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
for x in id_list:
url = 'https://magicseaweed.com' + x
r = requests.get(url, headers=headers)
try:
soup = BeautifulSoup(r.text, 'html.parser')
dfs = pd.read_html(str(soup))
for df in dfs:
print(df)
if df.shape[0] > 50:
df.to_csv(f"{x.replace('/', '_').replace('-', '_')}.csv")
print('____________')
except Exception as e:
print(x, e)
This return several dataframes for each page, some more, some less, and saves the ones with more than 50 rows:
0 1 2
0 Low 12:24AM -0.05m
1 High 6:25AM 1.28m
2 Low 12:28PM -0.01m
3 High 6:49PM 1.66m
____________
0 1
0 First Light 5:36AM
1 Sunrise 6:05AM
2 Sunset 8:00PM
3 Last Light 8:30PM
____________
Unnamed: 0 Surf Swell Rating Primary Swell Primary Swell.1 Primary Swell.2 Secondary Swell Secondary Swell.1 Secondary Swell.2 Secondary Swell.3 ... Wind Wind.1 Weather Weather.1 Prob. Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21
0 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 ... Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08 Wednesday 10/08
1 12am 0.5-0.8m NaN 0.9m 6s NaN 0.5m 9s NaN NaN ... 11 11 kph NaN NaN 26°c NaN NaN NaN NaN NaN NaN
2 3am 0.3-0.5m NaN 0.5m 9s NaN 0.8m 6s NaN NaN ... 13 17 kph NaN NaN 24°c NaN NaN NaN NaN NaN NaN
3 6am 0.2-0.3m NaN 0.5m 9s NaN 0.7m 6s NaN NaN ... 12 16 kph NaN NaN 24°c NaN NaN NaN NaN NaN NaN
4 9am 0.3-0.6m NaN 0.5m 9s NaN 0.7m 6s NaN NaN ... 13 16 kph NaN NaN 25°c NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
121 High 11:57PM 1.34m NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
122 First Light 5:42AM NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
123 Sunrise 6:10AM NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
124 Sunset 7:53PM NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
125 Last Light 8:21PM NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
126 rows × 22 columns
____________
0 1 2
0 Low 12:24AM -0.05m
1 High 6:25AM 1.28m
2 Low 12:28PM -0.01m
3 High 6:49PM 1.66m
____________
0 1
0 First Light 5:36AM
1 Sunrise 6:05AM
2 Sunset 8:00PM
3 Last Light 8:30PM
____________
0 1 2
0 Low 1:19AM -0.13m
1 High 7:21AM 1.37m
2 Low 1:26PM -0.06m
3 High 7:43PM 1.7m
____________
0 1
0 First Light 5:37AM
1 Sunrise 6:06AM
2 Sunset 7:59PM
3 Last Light 8:28PM
____________
0 1 2
0 Low 2:11AM -0.18m
1 High 8:14AM 1.43m
2 Low 2:21PM -0.09m
3 High 8:34PM 1.69m
____________
0 1
0 First Light 5:38AM
1 Sunrise 6:07AM
2 Sunset 7:58PM
3 Last Light 8:27PM
____________
0 1 2
0 Low 2:59AM -0.21m
1 High 9:05AM 1.47m
2 Low 3:13PM -0.09m
3 High 9:24PM 1.64m
____________
0 1
0 First Light 5:39AM
1 Sunrise 6:08AM
2 Sunset 7:57PM
3 Last Light 8:25PM
____________
0 1 2
0 Low 3:46AM -0.2m
1 High 9:57AM 1.47m
2 Low 4:03PM -0.06m
3 High 10:14PM 1.56m
____________
0 1
0 First Light 5:40AM
1 Sunrise 6:09AM
2 Sunset 7:55PM
3 Last Light 8:24PM
____________
0 1 2
0 Low 4:29AM -0.15m
1 High 10:48AM 1.46m
2 Low 4:52PM 0.01m
3 High 11:05PM 1.46m
____________
0 1
0 First Light 5:41AM
1 Sunrise 6:10AM
2 Sunset 7:54PM
3 Last Light 8:23PM
____________
0 1 2
0 Low 5:12AM -0.07m
1 High 11:39AM 1.43m
2 Low 5:42PM 0.1m
3 High 11:57PM 1.34m
____________
0 1
0 First Light 5:42AM
1 Sunrise 6:10AM
2 Sunset 7:53PM
3 Last Light 8:21PM