How to remove html comments using Beautiful Soup

Question:

I’m cleaning text from a crawled website, but I don’t want any html comments in my data, so do I have to parse it out myself or is there an existing function to do so?

I’ve tried doing this:

from bs4 import BeautifulSoup as S
soup = S("<!-- t --> <h1>Hejsa</h1> <style>html{color: #0000ff}</style>")
soup.comment # == None
soup.style   # == <style>html{color: #0000ff}</style>
Asked By: Marius Johan

||

Answers:

To search form HTML comments, you can use bs4.Comment type:

from bs4 import BeautifulSoup, Comment

html_doc = '''
    <!-- t --> <h1>Hejsa</h1> <style>html{color: #0000ff}</style>
'''

soup = BeautifulSoup(html_doc, 'html.parser')

# print comment:
comment = soup.find(text=lambda t: isinstance(t, Comment))
print( comment )

Prints:

t

To extract it:

comment = soup.find(text=lambda t: isinstance(t, Comment))

# extract comment:
comment.extract()
print(soup.prettify())

Prints:

<h1>
 Hejsa
</h1>
<style>
 html{color: #0000ff}
</style>
Answered By: Andrej Kesely

Use regex.

import re
html = "<!-- t --> <h1>Hejsa</h1> <style>html{color: #0000ff}</style>"
html = re.sub('<!--[sS]*-->', '', html).strip()
print(html)

Result:

<h1>Hejsa</h1> <style>html{color: #0000ff}</style>
Answered By: the_train

I need to do some quite basic scraping, however I am unable to extract the text "XYZ". Since the whole website I am looking into includes numerous cases of "", I’d like to remove those tags from the html code, but at the same time I’d like to keep all the text that is inside those tags, to find its content in the later stages.

<div class="display-flex align-items-center">
<span class="mr1 hoverable-link-text t-bold">
<span aria-hidden="true"><!-- -->XYZ<!-- --></span><span class="visually-hidden"><!-- -->XYZ<!-- --></span>
</span>
<!-- --><!-- --><!-- --> </div>

Initially, I was trying the simplest find / find_all method. But in the chrome’s inspect mode, the wesbsite’s section that I’m interested in begins with the following:

<section id="ember11" class="......" tabindex="-1">

and unfortunately searching the code by find or find_all method doesn’t work.

Could anyone please help?:)

Answered By: Kar01
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.