Python/Pandas: How to convert a bs4.element.ResultSet into a Pandas DataFrame?

Question

I want to extract the title and the link out of the bs4.element.ResultSet into a pandas dataframe:

Code:

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'
config = Config() 
config.browser_user_agent = user_agent 
user_input = "Solarpanels"
site = f'https://news.google.com/rss/search?q={user_input}+when:14d&hl=en-GB&gl=DE&ceid=GB:en' 
op = urlopen(site)
rd = op.read() 
sp_page = soup(rd, 'xml') 
news_list = sp_page.find_all('item')

print(type(news_list))
print(news_list)

Output:

<class 'bs4.element.ResultSet'>
[<item><title>Australian research finds cost-effective way to recycle solar panels - The Guardian</title><link>https://www.theguardian.com/environment/2022/oct/16/australian-research-finds-cost-effective-way-to-recycle-solar-panels</link><guid isPermaLink="false">1605236140</guid><pubDate>Sat, 15 Oct 2022 23:51:00 GMT</pubDate><description>&lt;ol&gt;&lt;li&gt;&lt;a href="https://www.theguardian.com/environment/2022/oct/16/australian-research-finds-cost-effective-way-to-recycle-solar-panels" target="_blank"&gt;Australian research finds cost-effective way to recycle solar panels&lt;/a&gt;&amp;nbsp;&amp;nbsp;&lt;font color="#6f6f6f"&gt;The Guardian&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://www.techjuice.pk/australian-researchers-find-cost-effective-way-to-recycle-solar-panels/" target="_blank"&gt;Australian Researchers Find Cost-Effective Way To Recycle Solar Panels&lt;/a&gt;&amp;nbsp;&amp;nbsp;&lt;font color="#6f6f6f"&gt;TechJuice&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://www.esi-africa.com/industry-sectors/business-and-markets/how-could-recycling-solar-panels-be-scaled-up-for-sustainable-effect/" target="_blank"&gt;How could recycling solar panels be scaled up for sustainable effect&lt;/a&gt;&amp;nbsp;&amp;nbsp;&lt;font color="#6f6f6f"&gt;ESI Africa&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://www.digitaljournal.com/pr/solar-panel-recycling-market-to-rise-at-37-cagr-during-forecast-period-tmr-study" target="_blank"&gt;Solar Panel Recycling Market to Rise at 37% CAGR during Forecast Period: TMR Study&lt;/a&gt;&amp;nbsp;&amp;nbsp;&lt;font color="#6f6f6f"&gt;Digital Journal&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;&lt;a href="https://news.google.com/stories/CAAqNggKIjBDQklTSGpvSmMzUnZjbmt0TXpZd1NoRUtEd2lzNjdmOUJSR3NNT0h4Y0h5dF9TZ0FQAQ?oc=5" target="_blank"&gt;View Full coverage on Google News&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;&lt;/ol&gt;</description><source url="https://www.theguardian.com">The Guardian</source></item> 

... and much more

I tried a lot, but unfortunately I can’t make it.

Asked By: langermc

||

Source

Answer 1

Try:

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
}

user_input = "Solarpanels"
site = f"https://news.google.com/rss/search?q={user_input}+when:14d&hl=en-GB&gl=DE&ceid=GB:en"


soup = BeautifulSoup(requests.get(site, headers=headers).content, "xml")

all_data = []
for item in soup.select("item"):
    all_data.append(
        {
            "title": item.title.text,
            "link": item.link.text,
            "pubDate": item.pubDate.text,
            "description": BeautifulSoup(
                item.description.text, "html.parser"
            ).get_text(strip=True), # or .get_text(strip=True, separator=" ")
            "source": item.source.text,
            "source_url": item.source["url"],
        }
    )

df = pd.DataFrame(all_data)
print(df.head().to_markdown(index=False))

Prints:

title	link	pubDate	description	source	source_url
Australian research finds cost-effective way to recycle solar panels – The Guardian	https://www.theguardian.com/environment/2022/oct/16/australian-research-finds-cost-effective-way-to-recycle-solar-panels	Sat, 15 Oct 2022 23:51:00 GMT	Australian research finds cost-effective way to recycle solar panelsThe GuardianAustralian Researchers Find Cost-Effective Way To Recycle Solar PanelsTechJuiceHow could recycling solar panels be scaled up for sustainable effectESI AfricaSolar Panel Recycling Market to Rise at 37% CAGR during Forecast Period: TMR StudyDigital JournalView Full coverage on Google News	The Guardian	https://www.theguardian.com
Business Matters: Solar Panels on Commercial Property: Why You Should Make the Switch – Insider Media	https://www.insidermedia.com/blogs/north-west/business-matters-solar-panels-on-commercial-property-why-you-should-make-the-switch	Mon, 17 Oct 2022 09:13:35 GMT	Business Matters: Solar Panels on Commercial Property: Why You Should Make the SwitchInsider Media	Insider Media	https://www.insidermedia.com
Cost of living: The people using solar panels and turbines to reduce bills – bbc.co.uk	https://www.bbc.co.uk/news/uk-england-essex-62967716	Wed, 05 Oct 2022 07:00:00 GMT	Cost of living: The people using solar panels and turbines to reduce billsbbc.co.uk	bbc.co.uk	https://www.bbc.co.uk
School applies for 120 solar panels – Stamford Mercury	https://www.stamfordmercury.co.uk/news/school-applies-for-120-solar-panels-9278921/	Mon, 17 Oct 2022 11:00:00 GMT	School applies for 120 solar panelsStamford Mercury	Stamford Mercury	https://www.stamfordmercury.co.uk
Solar panels enable Lanarkshire village hall to cut running costs by 80 per cent – Daily Record	https://www.dailyrecord.co.uk/in-your-area/lanarkshire/solar-panels-enable-lanarkshire-village-28211459	Sun, 16 Oct 2022 18:50:00 GMT	Solar panels enable Lanarkshire village hall to cut running costs by 80 per centDaily Record	Daily Record	https://www.dailyrecord.co.uk

Answered By: Andrej Kesely

Python/Pandas: How to convert a bs4.element.ResultSet into a Pandas DataFrame?

Question:

Answers: