BeautifulSoup find a href in marquee
Question:
I’m using bs4 to scrape links from a scrolling marquee. I’m able to get the marquee data, which is returned as a bs4 resultSet element. However, I cannot seem to access the href’s within the data.
I’m sure I’m missing something as I’m new to web scraping, and appreciate any guidance anyone has.
Note: I can get the links easy peasy with selenium and chrome driver, but it takes forever.
This returns all of the marquee data:
url = 'https://drugs.globalincidentmap.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
marquee = soup.select('div', class_='h-48')
print(marquee)
However when I try to drill down further into the data, I get the empty list or NoneType
/KeyError
or AttributeError
.
for a in marquee.find_all('a', href=True):
link = a.find('div', class_=':nth-child')
or
for a in marquee.find_all('a', href=True):
link = a.find('div', class_='flex p-2')
Links in marquee
Answers:
I can get the links easy peasy with selenium and chrome driver
Probably because the div
with h-48
class is loaded with JavaScript; even if it wasn’t, I don’t think soup.find('div', class_='h-48')
would work because that element has more classes, and you need to pass all of them as class_
[and I don’t think soup.select('div', class_='h-48')
gives the exact results you expect it to – select
isn’t really supposed to have a class_
argument – just a CSS selector string].
soup.find('div', attrs={'class':'h-48'})
or soup.select('div.h-48')
can be expected to work on the html that is formed after JS loading, but you need selenium to get that…
Fortunately, I think the data you want is already in the fetched html, just in a different format – you can extract a list of dictionaries (mqCont
) with
# import json
marq = soup.find('marquee', attrs={'class':'h-48'})
if marq is None: print('Could Not Find marquee.h-48')
if not marq.get(':contents'): print('marquee.h-48 has no [:contents] attr')
try: mqCont = json.loads(marq.get(':contents', '[]'))
except Exception as e:
mqCont = []
print('failed to parse marquee.h-48[:contents] <---', e)
or, more shortly (if you’re confident there won’t be any error to debug/breakdown):
mqCont = json.loads(soup.select_one('marquee.h-48').get(':contents', '[]'))
You can get a list of links to news articles with [m['url'] for m in mqCont if 'url' in m]
, but since you were trying to get find
with class_='flex p-2'
, you probably want the .../event_detail?id=...
links. You can form them as below
evtUrls = [f"{url.strip('/')}/event_detail?id={m['id']}" for m in mqCont if 'id' in m]
You can also view the list of dictionaries as a table [with pandas] by doing something like:
# import pandas
omitKeys = ['domain_event_types', 'country']
for i, m in enumerate(mqCont):
mDesc = ' '.join(w for w in BeautifulSoup(
m['description'] if 'description' in m else ''
).get_text().split() if w)
if mDesc: m['description'] = mDesc
if 'id' in m: m['eventUrl'] = f"{url.strip('/')}/event_detail?id={m['id']}"
mqCont[i] = {k:v for k, v in m.items() if k not in omitKeys}
mqcDF = pandas.DataFrame(mqCont).dropna(axis='columns', how='all').set_index('id')
and the first 5 rows [of 100 rows total] of mqcDF
:
id
country_id
address
event_gmt_time
severity
infrastructure
tip_text
url
description
latitude
longitude
created_user_id
location_granularity_id
is_approved
created_at
updated_at
eventUrl
11919404
231
Pennsylvania, USA
2022-12-01 18:36:53
Severe
Unknown
PENNSYLVANIA – Photos – Suspects – Evidence In Multi-County Drug Bust
https://www.wfmz.com/news/area/berks/photos-suspects-evidence-in-multi-county-drug-bust/collection_bf795c98-71ad-11ed-99fe-4305f426699b.html#1
[69 NEWS] PENNSYLVANIA – PHOTOS: Suspects, evidence in multi-county drug bust "Authorities said they seized evidence that included 27.5 kilograms of cocaine with a potential street value of $2.7 million and 5.5 kilograms of fentanyl with a potential street value of $1.6 million." Read full article at: https://www.wfmz.com/news/area/berks/photos-suspects-evidence-in-multi-county-drug-bust/collection_bf795c98-71ad-11ed-99fe-4305f426699b.html#1
41.2033
-77.1945
14
8
1
2022-12-02T18:44:43.000000Z
2022-12-02T18:44:43.000000Z
https://drugs.globalincidentmap.com/event_detail?id=11919404
11919401
40
Vancouver Island, British Columbia, Canada
2022-12-01 18:33:01
Severe
Unknown
CANADA – Drugs – Guns Seized As 4 BC Men With Hells Angels Ties Face Serious Charges
https://www.terracestandard.com/news/alleged-drug-traffickers-on-vancouver-island-with-hells-angels-ties-face-serious-charges/
[terracestandard.com] CANADA – Drugs, guns seized as 4 B.C. men with Hells Angels ties face ‘serious charges’ "CFSEU said the seized drugs included 7.75kg of cocaine, 4kg of cannabis, 1.9kg of methamphetamine, 248 oxycodone pills, and more." Read full article at: https://www.terracestandard.com/news/alleged-drug-traffickers-on-vancouver-island-with-hells-angels-ties-face-serious-charges/
49.6506
-125.449
14
5
1
2022-12-02T18:36:37.000000Z
2022-12-02T18:36:37.000000Z
https://drugs.globalincidentmap.com/event_detail?id=11919401
11919397
133
Male, Maldives
2022-11-20 18:29:26
Severe
Unknown
MALDIVES – Drugs Worth Mvr 2 Mln Seized By Customs
https://avas.mv/en/125385
[avas.mv] MALDIVES – Drugs worth MVR 2 mln seized by Customs "Maldives Customs Service has seized 1.34 kg of drugs smuggled into the Maldives via courier." Read full article at: https://avas.mv/en/125385
4.1755
73.5093
14
5
1
2022-12-02T18:32:45.000000Z
2022-12-02T18:32:45.000000Z
https://drugs.globalincidentmap.com/event_detail?id=11919397
11919394
231
100 South Willow Avenue, Compton, CA, USA
2022-11-29 18:23:50
Severe
Unknown
CALIFORNIA – USD4 Million Worth Of Illegal Drugs Seized In Compton
https://www.foxla.com/news/4-million-worth-of-illegal-drugs-seized-in-compton
[foxla] CALIFORNIA – $4 million worth of illegal drugs seized in Compton "A search warrant at the home resulted in the seizure of about 5.5 lbs. of suspected tar heroin, 10 kilos of suspected powder cocaine, 6 kilos of suspected powder fentanyl, 6,000 suspected ecstasy pills containing fentanyl, and 254,000 suspected fentanyl pills all worth a combined estimated street value of $4.17 million, authorities said. " Read full article at: https://www.foxla.com/news/4-million-worth-of-illegal-drugs-seized-in-compton
33.896
-118.218
14
5
1
2022-12-02T18:29:25.000000Z
2022-12-02T18:29:25.000000Z
https://drugs.globalincidentmap.com/event_detail?id=11919394
11919392
166
Gwadar, Pakistan
2022-12-01 18:22:00
Severe
Unknown
PAKISTAN – Convoy Of Camels Loaded With Drugs Seized
https://pakobserver.net/convoy-of-camels-loaded-with-drugs-seized/
[pakobserver.net] PAKISTAN – Convoy Of Camels Loaded With Drugs Seized "While searching the goods carried by the camels, ANF officials found them to be full of drugs (hashish). The drugs weighed around 1.4 tons." Read full article at: https://pakobserver.net/convoy-of-camels-loaded-with-drugs-seized/
25.1313
62.325
14
5
1
2022-12-02T18:23:49.000000Z
2022-12-02T18:23:49.000000Z
https://drugs.globalincidentmap.com/event_detail?id=11919392
Markdown for the above table was printed with print(mqcDf.loc[mqcDf.index[:5]].to_markdown())
I’m using bs4 to scrape links from a scrolling marquee. I’m able to get the marquee data, which is returned as a bs4 resultSet element. However, I cannot seem to access the href’s within the data.
I’m sure I’m missing something as I’m new to web scraping, and appreciate any guidance anyone has.
Note: I can get the links easy peasy with selenium and chrome driver, but it takes forever.
This returns all of the marquee data:
url = 'https://drugs.globalincidentmap.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
marquee = soup.select('div', class_='h-48')
print(marquee)
However when I try to drill down further into the data, I get the empty list or NoneType
/KeyError
or AttributeError
.
for a in marquee.find_all('a', href=True):
link = a.find('div', class_=':nth-child')
or
for a in marquee.find_all('a', href=True):
link = a.find('div', class_='flex p-2')
Links in marquee
I can get the links easy peasy with selenium and chrome driver
Probably because the div
with h-48
class is loaded with JavaScript; even if it wasn’t, I don’t think soup.find('div', class_='h-48')
would work because that element has more classes, and you need to pass all of them as class_
[and I don’t think soup.select('div', class_='h-48')
gives the exact results you expect it to – select
isn’t really supposed to have a class_
argument – just a CSS selector string].
soup.find('div', attrs={'class':'h-48'})
or soup.select('div.h-48')
can be expected to work on the html that is formed after JS loading, but you need selenium to get that…
Fortunately, I think the data you want is already in the fetched html, just in a different format – you can extract a list of dictionaries (mqCont
) with
# import json
marq = soup.find('marquee', attrs={'class':'h-48'})
if marq is None: print('Could Not Find marquee.h-48')
if not marq.get(':contents'): print('marquee.h-48 has no [:contents] attr')
try: mqCont = json.loads(marq.get(':contents', '[]'))
except Exception as e:
mqCont = []
print('failed to parse marquee.h-48[:contents] <---', e)
or, more shortly (if you’re confident there won’t be any error to debug/breakdown):
mqCont = json.loads(soup.select_one('marquee.h-48').get(':contents', '[]'))
You can get a list of links to news articles with [m['url'] for m in mqCont if 'url' in m]
, but since you were trying to get find
with class_='flex p-2'
, you probably want the .../event_detail?id=...
links. You can form them as below
evtUrls = [f"{url.strip('/')}/event_detail?id={m['id']}" for m in mqCont if 'id' in m]
You can also view the list of dictionaries as a table [with pandas] by doing something like:
# import pandas
omitKeys = ['domain_event_types', 'country']
for i, m in enumerate(mqCont):
mDesc = ' '.join(w for w in BeautifulSoup(
m['description'] if 'description' in m else ''
).get_text().split() if w)
if mDesc: m['description'] = mDesc
if 'id' in m: m['eventUrl'] = f"{url.strip('/')}/event_detail?id={m['id']}"
mqCont[i] = {k:v for k, v in m.items() if k not in omitKeys}
mqcDF = pandas.DataFrame(mqCont).dropna(axis='columns', how='all').set_index('id')
and the first 5 rows [of 100 rows total] of mqcDF
:
id | country_id | address | event_gmt_time | severity | infrastructure | tip_text | url | description | latitude | longitude | created_user_id | location_granularity_id | is_approved | created_at | updated_at | eventUrl |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11919404 | 231 | Pennsylvania, USA | 2022-12-01 18:36:53 | Severe | Unknown | PENNSYLVANIA – Photos – Suspects – Evidence In Multi-County Drug Bust | https://www.wfmz.com/news/area/berks/photos-suspects-evidence-in-multi-county-drug-bust/collection_bf795c98-71ad-11ed-99fe-4305f426699b.html#1 | [69 NEWS] PENNSYLVANIA – PHOTOS: Suspects, evidence in multi-county drug bust "Authorities said they seized evidence that included 27.5 kilograms of cocaine with a potential street value of $2.7 million and 5.5 kilograms of fentanyl with a potential street value of $1.6 million." Read full article at: https://www.wfmz.com/news/area/berks/photos-suspects-evidence-in-multi-county-drug-bust/collection_bf795c98-71ad-11ed-99fe-4305f426699b.html#1 | 41.2033 | -77.1945 | 14 | 8 | 1 | 2022-12-02T18:44:43.000000Z | 2022-12-02T18:44:43.000000Z | https://drugs.globalincidentmap.com/event_detail?id=11919404 |
11919401 | 40 | Vancouver Island, British Columbia, Canada | 2022-12-01 18:33:01 | Severe | Unknown | CANADA – Drugs – Guns Seized As 4 BC Men With Hells Angels Ties Face Serious Charges | https://www.terracestandard.com/news/alleged-drug-traffickers-on-vancouver-island-with-hells-angels-ties-face-serious-charges/ | [terracestandard.com] CANADA – Drugs, guns seized as 4 B.C. men with Hells Angels ties face ‘serious charges’ "CFSEU said the seized drugs included 7.75kg of cocaine, 4kg of cannabis, 1.9kg of methamphetamine, 248 oxycodone pills, and more." Read full article at: https://www.terracestandard.com/news/alleged-drug-traffickers-on-vancouver-island-with-hells-angels-ties-face-serious-charges/ | 49.6506 | -125.449 | 14 | 5 | 1 | 2022-12-02T18:36:37.000000Z | 2022-12-02T18:36:37.000000Z | https://drugs.globalincidentmap.com/event_detail?id=11919401 |
11919397 | 133 | Male, Maldives | 2022-11-20 18:29:26 | Severe | Unknown | MALDIVES – Drugs Worth Mvr 2 Mln Seized By Customs | https://avas.mv/en/125385 | [avas.mv] MALDIVES – Drugs worth MVR 2 mln seized by Customs "Maldives Customs Service has seized 1.34 kg of drugs smuggled into the Maldives via courier." Read full article at: https://avas.mv/en/125385 | 4.1755 | 73.5093 | 14 | 5 | 1 | 2022-12-02T18:32:45.000000Z | 2022-12-02T18:32:45.000000Z | https://drugs.globalincidentmap.com/event_detail?id=11919397 |
11919394 | 231 | 100 South Willow Avenue, Compton, CA, USA | 2022-11-29 18:23:50 | Severe | Unknown | CALIFORNIA – USD4 Million Worth Of Illegal Drugs Seized In Compton | https://www.foxla.com/news/4-million-worth-of-illegal-drugs-seized-in-compton | [foxla] CALIFORNIA – $4 million worth of illegal drugs seized in Compton "A search warrant at the home resulted in the seizure of about 5.5 lbs. of suspected tar heroin, 10 kilos of suspected powder cocaine, 6 kilos of suspected powder fentanyl, 6,000 suspected ecstasy pills containing fentanyl, and 254,000 suspected fentanyl pills all worth a combined estimated street value of $4.17 million, authorities said. " Read full article at: https://www.foxla.com/news/4-million-worth-of-illegal-drugs-seized-in-compton | 33.896 | -118.218 | 14 | 5 | 1 | 2022-12-02T18:29:25.000000Z | 2022-12-02T18:29:25.000000Z | https://drugs.globalincidentmap.com/event_detail?id=11919394 |
11919392 | 166 | Gwadar, Pakistan | 2022-12-01 18:22:00 | Severe | Unknown | PAKISTAN – Convoy Of Camels Loaded With Drugs Seized | https://pakobserver.net/convoy-of-camels-loaded-with-drugs-seized/ | [pakobserver.net] PAKISTAN – Convoy Of Camels Loaded With Drugs Seized "While searching the goods carried by the camels, ANF officials found them to be full of drugs (hashish). The drugs weighed around 1.4 tons." Read full article at: https://pakobserver.net/convoy-of-camels-loaded-with-drugs-seized/ | 25.1313 | 62.325 | 14 | 5 | 1 | 2022-12-02T18:23:49.000000Z | 2022-12-02T18:23:49.000000Z | https://drugs.globalincidentmap.com/event_detail?id=11919392 |
Markdown for the above table was printed with print(mqcDf.loc[mqcDf.index[:5]].to_markdown())