Remove all text from a html node using regex
Question:
Is it possible to remove all text from HTML nodes with a regex? This very simple case seems to work just fine:
import htmlmin
html = """
<li class="menu-item">
<p class="menu-item__heading">Totopos</p>
<p>Chips and molcajete salsa</p>
<p class="menu-item__details menu-item__details--price">
<strong>
<span class="menu-item__currency"> $ </span>
4
</strong>
</p>
</li>
"""
print(re.sub(">(.*?)<", ">1<", htmlmin.minify(html)))
I tried to use BeautifulSoup but I cannot figure out how to make it work. Using the following code example is not quite correct since it is leaving "4" in as text.
soup = BeautifulSoup(html, "html.parser")
for n in soup.find_all(recursive=True):
print(n.name, n.string)
if n.string:
n.string = ""
print(minify(str(soup)))
Answers:
try to use text=True
when you call find_all
and call extract()
on element to remove it:
from bs4 import BeautifulSoup
html = '''
<li class="menu-item">
<p class="menu-item__heading">Totopos</p>
<p>Chips and molcajete salsa</p>
<p class="menu-item__details menu-item__details--price">
<strong>
<span class="menu-item__currency"> $ </span>
4
</strong>
</p>
</li>
'''
soup = BeautifulSoup(html, 'html.parser')
for element in soup.find_all(text=True):
element.extract()
print(soup.prettify())
the output will be in this case:
<li class="menu-item">
<p class="menu-item__heading">
</p>
<p>
</p>
<p class="menu-item__details menu-item__details--price">
<strong>
<span class="menu-item__currency">
</span>
</strong>
</p>
</li>
Attempting to manipulate HTML using regular expressions is almost never the best idea, but this regex should do the trick for you:
print(re.sub(r">[^<]+<", "><", htmlmin.minify(html)))
Is it possible to remove all text from HTML nodes with a regex? This very simple case seems to work just fine:
import htmlmin
html = """
<li class="menu-item">
<p class="menu-item__heading">Totopos</p>
<p>Chips and molcajete salsa</p>
<p class="menu-item__details menu-item__details--price">
<strong>
<span class="menu-item__currency"> $ </span>
4
</strong>
</p>
</li>
"""
print(re.sub(">(.*?)<", ">1<", htmlmin.minify(html)))
I tried to use BeautifulSoup but I cannot figure out how to make it work. Using the following code example is not quite correct since it is leaving "4" in as text.
soup = BeautifulSoup(html, "html.parser")
for n in soup.find_all(recursive=True):
print(n.name, n.string)
if n.string:
n.string = ""
print(minify(str(soup)))
try to use text=True
when you call find_all
and call extract()
on element to remove it:
from bs4 import BeautifulSoup
html = '''
<li class="menu-item">
<p class="menu-item__heading">Totopos</p>
<p>Chips and molcajete salsa</p>
<p class="menu-item__details menu-item__details--price">
<strong>
<span class="menu-item__currency"> $ </span>
4
</strong>
</p>
</li>
'''
soup = BeautifulSoup(html, 'html.parser')
for element in soup.find_all(text=True):
element.extract()
print(soup.prettify())
the output will be in this case:
<li class="menu-item">
<p class="menu-item__heading">
</p>
<p>
</p>
<p class="menu-item__details menu-item__details--price">
<strong>
<span class="menu-item__currency">
</span>
</strong>
</p>
</li>
Attempting to manipulate HTML using regular expressions is almost never the best idea, but this regex should do the trick for you:
print(re.sub(r">[^<]+<", "><", htmlmin.minify(html)))