When exporting to a csv file I get tags but in my terminal when I run the code I don't see them

Question:

I’m building a web scraper for testing/education purposes and I’m running into the following issue:

When I run the code the terminal displays the text, dates and links as expected, but when I export the info to a csv file the last column where I’m supposed to see the adte and time of the article when it was published I get the html

and tags.

I believe it’s something either with the formatting or I’m not hitting the right method to remove the p and span tags (this is tricky I’ve tried a few methods and suggestions already discussed here). It works with ‘get_text()’, but doesn’t export it correctly to csv.

My code:

import requests 
from bs4 import BeautifulSoup
import sys
import pandas as pd



url = 'https://novini.bg'
request = requests.get('https://novini.bg')
soup = BeautifulSoup(request.text, 'html.parser')
articles = soup.find_all('article', class_= 'g-grid__item js-content')
art = []
for article in articles:
    article_link = article.a.get('href')
    article_title = article.find('h2', {'class', 'g-grid__item-title'}).text
    article_date = article.find('p', class_ = 'g-grid__item--time')
    
    print(article_title, end='n')
    print(article_link, end='n'*2)
    if article_date == None:
        print('')
    else:
        print(article_date.text)
    

    art.append({
        'Headline': article_title,
        'Link': article_link,
        'Date Published': article_date
    })

df = pd.DataFrame(art)
df.to_csv('News_Bulgaria.csv', index=False)

Screenshot below:
p and span tags appear after export to csv

Any suggestions would be appreciated.
Thanks!

Asked By: gginchev

||

Answers:

Because you aren’t invoking .text property in date column

import requests 
from bs4 import BeautifulSoup
import sys
import pandas as pd

url = 'https://novini.bg'
request = requests.get('https://novini.bg')
soup = BeautifulSoup(request.text, 'html.parser')
articles = soup.find_all('article', class_= 'g-grid__item js-content')
art = []
for article in articles:
    article_link = article.a.get('href')
    article_title = article.find('h2', {'class', 'g-grid__item-title'}).text
    article_date = article.find('p', class_ = 'g-grid__item--time')
    article_date = article_date.text if article_date else None
    
    #print(article_title, end='n')
    #print(article_link, end='n'*2)
    # if article_date == None:
    #     print('')
    # else:
    #     print(article_date.text)
    

    art.append({
        'Headline': article_title,
        'Link': article_link,
        'Date Published': article_date
    })

#print(art)

df = pd.DataFrame(art)
df.to_csv('News_Bulgaria.csv', index=False)
Answered By: F.Hoque
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.