How to remove html comments using Beautiful Soup
Question:
I’m cleaning text from a crawled website, but I don’t want any html comments in my data, so do I have to parse it out myself or is there an existing function to do so?
I’ve tried doing this:
from bs4 import BeautifulSoup as S
soup = S("<!-- t --> <h1>Hejsa</h1> <style>html{color: #0000ff}</style>")
soup.comment # == None
soup.style # == <style>html{color: #0000ff}</style>
Answers:
To search form HTML comments, you can use bs4.Comment
type:
from bs4 import BeautifulSoup, Comment
html_doc = '''
<!-- t --> <h1>Hejsa</h1> <style>html{color: #0000ff}</style>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# print comment:
comment = soup.find(text=lambda t: isinstance(t, Comment))
print( comment )
Prints:
t
To extract it:
comment = soup.find(text=lambda t: isinstance(t, Comment))
# extract comment:
comment.extract()
print(soup.prettify())
Prints:
<h1>
Hejsa
</h1>
<style>
html{color: #0000ff}
</style>
Use regex.
import re
html = "<!-- t --> <h1>Hejsa</h1> <style>html{color: #0000ff}</style>"
html = re.sub('<!--[sS]*-->', '', html).strip()
print(html)
Result:
<h1>Hejsa</h1> <style>html{color: #0000ff}</style>
I need to do some quite basic scraping, however I am unable to extract the text "XYZ". Since the whole website I am looking into includes numerous cases of "", I’d like to remove those tags from the html code, but at the same time I’d like to keep all the text that is inside those tags, to find its content in the later stages.
<div class="display-flex align-items-center">
<span class="mr1 hoverable-link-text t-bold">
<span aria-hidden="true"><!-- -->XYZ<!-- --></span><span class="visually-hidden"><!-- -->XYZ<!-- --></span>
</span>
<!-- --><!-- --><!-- --> </div>
Initially, I was trying the simplest find / find_all method. But in the chrome’s inspect mode, the wesbsite’s section that I’m interested in begins with the following:
<section id="ember11" class="......" tabindex="-1">
and unfortunately searching the code by find or find_all method doesn’t work.
Could anyone please help?:)
I’m cleaning text from a crawled website, but I don’t want any html comments in my data, so do I have to parse it out myself or is there an existing function to do so?
I’ve tried doing this:
from bs4 import BeautifulSoup as S
soup = S("<!-- t --> <h1>Hejsa</h1> <style>html{color: #0000ff}</style>")
soup.comment # == None
soup.style # == <style>html{color: #0000ff}</style>
To search form HTML comments, you can use bs4.Comment
type:
from bs4 import BeautifulSoup, Comment
html_doc = '''
<!-- t --> <h1>Hejsa</h1> <style>html{color: #0000ff}</style>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# print comment:
comment = soup.find(text=lambda t: isinstance(t, Comment))
print( comment )
Prints:
t
To extract it:
comment = soup.find(text=lambda t: isinstance(t, Comment))
# extract comment:
comment.extract()
print(soup.prettify())
Prints:
<h1>
Hejsa
</h1>
<style>
html{color: #0000ff}
</style>
Use regex.
import re
html = "<!-- t --> <h1>Hejsa</h1> <style>html{color: #0000ff}</style>"
html = re.sub('<!--[sS]*-->', '', html).strip()
print(html)
Result:
<h1>Hejsa</h1> <style>html{color: #0000ff}</style>
I need to do some quite basic scraping, however I am unable to extract the text "XYZ". Since the whole website I am looking into includes numerous cases of "", I’d like to remove those tags from the html code, but at the same time I’d like to keep all the text that is inside those tags, to find its content in the later stages.
<div class="display-flex align-items-center">
<span class="mr1 hoverable-link-text t-bold">
<span aria-hidden="true"><!-- -->XYZ<!-- --></span><span class="visually-hidden"><!-- -->XYZ<!-- --></span>
</span>
<!-- --><!-- --><!-- --> </div>
Initially, I was trying the simplest find / find_all method. But in the chrome’s inspect mode, the wesbsite’s section that I’m interested in begins with the following:
<section id="ember11" class="......" tabindex="-1">
and unfortunately searching the code by find or find_all method doesn’t work.
Could anyone please help?:)