How to extract links from a website in python?
Question:
I am trying to webscrape the below website. As a first step, I would like to get the links from which to extract the text. However, when I do the following, I get an empty list:
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.federalreserve.gov/newsevents/speeches.htm'
r = BeautifulSoup(requests.get(url).content, features = "lxml")
r.select('.itemTitle')
Can anyone tell me what am I doing wrong?
Thanks
Answers:
You could also request the JSON from the enpoint data is loaded from and based on your imports convert it into a pandas dataframe.
import requests, json
import pandas as pd
pd.DataFrame(
json.loads(requests.get(f'https://www.federalreserve.gov/json/ne-speeches.json').content)
)
Output
d
t
s
lo
l
a
o
v
video
updateDate
0
3/29/2023 8:30:00 AM
Brief Remarks
Vice Chair for Supervision Michael S. Barr
At the National Community Reinvestment Coalition Just Economy Conference, Washington, D.C. (via prerecorded video)
/newsevents/speech/barr20230329a.htm
no
No
nan
1
3/27/2023 5:00:00 PM
Implementation and Transmission of Monetary Policy
Governor Philip N. Jefferson
At the H. Parker Willis Lecture, Washington and Lee University, Lexington, Virginia
/newsevents/speech/jefferson20230327a.htm
no
No
nan
2
3/14/2023 5:20:00 PM
The Innovation Imperative: Modernizing Traditional Banking
Governor Michelle W. Bowman
At the Independent Community Bankers of America ICBA Live 2023 Conference, Honolulu, Hawaii
/newsevents/speech/bowman20230314a.htm
no
No
nan
3
3/9/2023 10:00:00 AM
Supporting Innovation with Guardrails: The Federal Reserve’s Approach to Supervision and Regulation of Banks’ Crypto-related Activities
Vice Chair for Supervision Michael S. Barr
At the Peterson Institute for International Economics, Washington, D.C.
/newsevents/speech/barr20230309a.htm
no
https://www.youtube.com/user/PetersonInstitute
No
nan
4
3/3/2023 3:00:00 PM
Panel on “Design Issues for Central Bank Facilities in the Future”
Governor Michelle W. Bowman
At The Chicago Booth Initiative on Global Markets Workshop on Market Dysfunction, Chicago, Illinois
/newsevents/speech/bowman20230303a.htm
no
No
nan
…
973
1/18/2017 3:00:00 PM
The Goals of Monetary Policy and How We Pursue Them
Chair Janet L. Yellen
At the Commonwealth Club, San Francisco, California
/newsevents/speech/yellen20170118a.htm
no
Yes
nan
974
1/17/2017 10:00:00 AM
Monetary Policy in a Time of Uncertainty
Governor Lael Brainard
At the Brookings Institution, Washington, D.C.
/newsevents/speech/brainard20170117a.htm
no
Yes
nan
975
1/12/2017 7:00:00 PM
Welcoming Remarks
Chair Janet L. Yellen
At the Conversation with the Chair: A Teacher Town Hall Meeting, Washington, D.C.
/newsevents/speech/yellen20170112a.htm
no
Yes
nan
976
1/7/2017 11:15:00 AM
Low Interest Rates and the Financial System
Governor Jerome H. Powell
At the 77th Annual Meeting of the American Finance Association, Chicago, Illinois
/newsevents/speech/powell20170107a.htm
no
No
nan
No pandas
approach:
import json
import string
import requests
url = "https://www.federalreserve.gov/json/ne-speeches.json"
speeches = json.loads(
"".join(filter(lambda x: x in string.printable, requests.get(url).text))
)
for speech in speeches:
try:
print(f"https://www.federalreserve.gov{speech['l']}")
except KeyError:
print("No link :(")
Output:
https://www.federalreserve.gov/newsevents/speech/barr20230329a.htm
https://www.federalreserve.gov/newsevents/speech/jefferson20230327a.htm
https://www.federalreserve.gov/newsevents/speech/bowman20230314a.htm
https://www.federalreserve.gov/newsevents/speech/barr20230309a.htm
https://www.federalreserve.gov/newsevents/speech/bowman20230303a.htm
https://www.federalreserve.gov/newsevents/speech/waller20230302a.htm
https://www.federalreserve.gov/newsevents/speech/jefferson20230227a.htm
https://www.federalreserve.gov/newsevents/speech/jefferson20230224a.htm
https://www.federalreserve.gov/newsevents/speech/cook20230216a.htm
https://www.federalreserve.gov/newsevents/speech/bowman20230215a.htm
https://www.federalreserve.gov/newsevents/speech/bowman20230213a.htm
https://www.federalreserve.gov/newsevents/speech/waller20230210a.htm
...
I am trying to webscrape the below website. As a first step, I would like to get the links from which to extract the text. However, when I do the following, I get an empty list:
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.federalreserve.gov/newsevents/speeches.htm'
r = BeautifulSoup(requests.get(url).content, features = "lxml")
r.select('.itemTitle')
Can anyone tell me what am I doing wrong?
Thanks
You could also request the JSON from the enpoint data is loaded from and based on your imports convert it into a pandas dataframe.
import requests, json
import pandas as pd
pd.DataFrame(
json.loads(requests.get(f'https://www.federalreserve.gov/json/ne-speeches.json').content)
)
Output
d | t | s | lo | l | a | o | v | video | updateDate | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 3/29/2023 8:30:00 AM | Brief Remarks | Vice Chair for Supervision Michael S. Barr | At the National Community Reinvestment Coalition Just Economy Conference, Washington, D.C. (via prerecorded video) | /newsevents/speech/barr20230329a.htm | no | No | nan | ||
1 | 3/27/2023 5:00:00 PM | Implementation and Transmission of Monetary Policy | Governor Philip N. Jefferson | At the H. Parker Willis Lecture, Washington and Lee University, Lexington, Virginia | /newsevents/speech/jefferson20230327a.htm | no | No | nan | ||
2 | 3/14/2023 5:20:00 PM | The Innovation Imperative: Modernizing Traditional Banking | Governor Michelle W. Bowman | At the Independent Community Bankers of America ICBA Live 2023 Conference, Honolulu, Hawaii | /newsevents/speech/bowman20230314a.htm | no | No | nan | ||
3 | 3/9/2023 10:00:00 AM | Supporting Innovation with Guardrails: The Federal Reserve’s Approach to Supervision and Regulation of Banks’ Crypto-related Activities | Vice Chair for Supervision Michael S. Barr | At the Peterson Institute for International Economics, Washington, D.C. | /newsevents/speech/barr20230309a.htm | no | https://www.youtube.com/user/PetersonInstitute | No | nan | |
4 | 3/3/2023 3:00:00 PM | Panel on “Design Issues for Central Bank Facilities in the Future” | Governor Michelle W. Bowman | At The Chicago Booth Initiative on Global Markets Workshop on Market Dysfunction, Chicago, Illinois | /newsevents/speech/bowman20230303a.htm | no | No | nan | ||
… | ||||||||||
973 | 1/18/2017 3:00:00 PM | The Goals of Monetary Policy and How We Pursue Them | Chair Janet L. Yellen | At the Commonwealth Club, San Francisco, California | /newsevents/speech/yellen20170118a.htm | no | Yes | nan | ||
974 | 1/17/2017 10:00:00 AM | Monetary Policy in a Time of Uncertainty | Governor Lael Brainard | At the Brookings Institution, Washington, D.C. | /newsevents/speech/brainard20170117a.htm | no | Yes | nan | ||
975 | 1/12/2017 7:00:00 PM | Welcoming Remarks | Chair Janet L. Yellen | At the Conversation with the Chair: A Teacher Town Hall Meeting, Washington, D.C. | /newsevents/speech/yellen20170112a.htm | no | Yes | nan | ||
976 | 1/7/2017 11:15:00 AM | Low Interest Rates and the Financial System | Governor Jerome H. Powell | At the 77th Annual Meeting of the American Finance Association, Chicago, Illinois | /newsevents/speech/powell20170107a.htm | no | No | nan |
No pandas
approach:
import json
import string
import requests
url = "https://www.federalreserve.gov/json/ne-speeches.json"
speeches = json.loads(
"".join(filter(lambda x: x in string.printable, requests.get(url).text))
)
for speech in speeches:
try:
print(f"https://www.federalreserve.gov{speech['l']}")
except KeyError:
print("No link :(")
Output:
https://www.federalreserve.gov/newsevents/speech/barr20230329a.htm
https://www.federalreserve.gov/newsevents/speech/jefferson20230327a.htm
https://www.federalreserve.gov/newsevents/speech/bowman20230314a.htm
https://www.federalreserve.gov/newsevents/speech/barr20230309a.htm
https://www.federalreserve.gov/newsevents/speech/bowman20230303a.htm
https://www.federalreserve.gov/newsevents/speech/waller20230302a.htm
https://www.federalreserve.gov/newsevents/speech/jefferson20230227a.htm
https://www.federalreserve.gov/newsevents/speech/jefferson20230224a.htm
https://www.federalreserve.gov/newsevents/speech/cook20230216a.htm
https://www.federalreserve.gov/newsevents/speech/bowman20230215a.htm
https://www.federalreserve.gov/newsevents/speech/bowman20230213a.htm
https://www.federalreserve.gov/newsevents/speech/waller20230210a.htm
...