Remove all text from a html node using regex

Question

Is it possible to remove all text from HTML nodes with a regex? This very simple case seems to work just fine:

import htmlmin

html = """
<li class="menu-item">
  <p class="menu-item__heading">Totopos</p>
  <p>Chips and molcajete salsa</p>
  <p class="menu-item__details menu-item__details--price">
    <strong>
      <span class="menu-item__currency"> $ </span>
      4
    </strong>
  </p>
</li>
"""

print(re.sub(">(.*?)<", ">1<", htmlmin.minify(html)))

I tried to use BeautifulSoup but I cannot figure out how to make it work. Using the following code example is not quite correct since it is leaving "4" in as text.

soup = BeautifulSoup(html, "html.parser")
for n in soup.find_all(recursive=True):
    print(n.name, n.string)
    if n.string:
        n.string = ""
print(minify(str(soup)))

Asked By: chhenning

||

Source

Answer 1

try to use text=True when you call find_all and call extract() on element to remove it:

from bs4 import BeautifulSoup

html = '''
<li class="menu-item">
  <p class="menu-item__heading">Totopos</p>
  <p>Chips and molcajete salsa</p>
  <p class="menu-item__details menu-item__details--price">
    <strong>
      <span class="menu-item__currency"> $ </span>
      4
    </strong>
  </p>
</li>
'''

soup = BeautifulSoup(html, 'html.parser')
for element in soup.find_all(text=True):
    element.extract()

print(soup.prettify())

the output will be in this case:

<li class="menu-item">
 <p class="menu-item__heading">
 </p>
 <p>
 </p>
 <p class="menu-item__details menu-item__details--price">
  <strong>
   <span class="menu-item__currency">
   </span>
  </strong>
 </p>
</li>

Answered By: godot

Answer 2

Attempting to manipulate HTML using regular expressions is almost never the best idea, but this regex should do the trick for you:

print(re.sub(r">[^<]+<", "><", htmlmin.minify(html)))

Answered By: Matias Cicero

Remove all text from a html node using regex

Question:

Answers: