Remove HTML tags not on an allowed list from a Python string

Question

I have a string containing text and HTML. I want to remove or otherwise disable some HTML tags, such as <script>, while allowing others, so that I can render it on a web page safely. I have a list of allowed tags, how can I process the string to remove any other tags?

Asked By: Everett Toews

||

Source

Answer 1

Here’s a simple solution using BeautifulSoup:

from bs4 import BeautifulSoup

VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br']

def sanitize_html(value):

    soup = BeautifulSoup(value)

    for tag in soup.findAll(True):
        if tag.name not in VALID_TAGS:
            tag.hidden = True

    return soup.renderContents()

If you want to remove the contents of the invalid tags as well, substitute tag.extract() for tag.hidden.

You might also look into using lxml and Tidy.

Answered By: bryan

Answer 2

The above solutions via Beautiful Soup will not work. You might be able to hack something with Beautiful Soup above and beyond them, because Beautiful Soup provides access to the parse tree. In a while, I think I’ll try to solve the problem properly, but it’s a week-long project or so, and I don’t have a free week soon.

Just to be specific, not only will Beautiful Soup throw exceptions for some parsing errors which the above code doesn’t catch; but also, there are plenty of very real XSS vulnerabilities that aren’t caught, like:

<<script>script> alert("Haha, I hacked your page."); </</script>script>

Probably the best thing that you can do is instead to strip out the < element as <, to prohibit all HTML, and then use a restricted subset like Markdown to render formatting properly. In particular, you can also go back and re-introduce common bits of HTML with a regex. Here’s what the process looks like, roughly:

_lt_     = re.compile('<')
_tc_ = '~(lt)~'   # or whatever, so long as markdown doesn't mangle it.     
_ok_ = re.compile(_tc_ + '(/?(?:u|b|i|em|strong|sup|sub|p|br|q|blockquote|code))>', re.I)
_sqrt_ = re.compile(_tc_ + 'sqrt>', re.I)     #just to give an example of extending
_endsqrt_ = re.compile(_tc_ + '/sqrt>', re.I) #html syntax with your own elements.
_tcre_ = re.compile(_tc_)

def sanitize(text):
    text = _lt_.sub(_tc_, text)
    text = markdown(text)
    text = _ok_.sub(r'<1>', text)
    text = _sqrt_.sub(r'&radic;<span style="text-decoration:overline;">', text)
    text = _endsqrt_.sub(r'</span>', text)
    return _tcre_.sub('&lt;', text)

I haven’t tested that code yet, so there may be bugs. But you see the general idea: you have to blacklist all HTML in general before you whitelist the ok stuff.

Answered By: Chris Drost

Answer 3

Here is what i use in my own project. The acceptable_elements/attributes come from feedparser and BeautifulSoup does the work.

from BeautifulSoup import BeautifulSoup

acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b', 'big',
      'blockquote', 'br', 'button', 'caption', 'center', 'cite', 'code', 'col',
      'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt', 'em',
      'font', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 
      'ins', 'kbd', 'label', 'legend', 'li', 'map', 'menu', 'ol', 
      'p', 'pre', 'q', 's', 'samp', 'small', 'span', 'strike',
      'strong', 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th',
      'thead', 'tr', 'tt', 'u', 'ul', 'var']

acceptable_attributes = ['abbr', 'accept', 'accept-charset', 'accesskey',
  'action', 'align', 'alt', 'axis', 'border', 'cellpadding', 'cellspacing',
  'char', 'charoff', 'charset', 'checked', 'cite', 'clear', 'cols',
  'colspan', 'color', 'compact', 'coords', 'datetime', 'dir', 
  'enctype', 'for', 'headers', 'height', 'href', 'hreflang', 'hspace',
  'id', 'ismap', 'label', 'lang', 'longdesc', 'maxlength', 'method',
  'multiple', 'name', 'nohref', 'noshade', 'nowrap', 'prompt', 
  'rel', 'rev', 'rows', 'rowspan', 'rules', 'scope', 'shape', 'size',
  'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'type',
  'usemap', 'valign', 'value', 'vspace', 'width']

def clean_html( fragment ):
    while True:
        soup = BeautifulSoup( fragment )
        removed = False        
        for tag in soup.findAll(True): # find all tags
            if tag.name not in acceptable_elements:
                tag.extract() # remove the bad ones
                removed = True
            else: # it might have bad attributes
                # a better way to get all attributes?
                for attr in tag._getAttrMap().keys():
                    if attr not in acceptable_attributes:
                        del tag[attr]

        # turn it back to html
        fragment = unicode(soup)

        if removed:
            # we removed tags and tricky can could exploit that!
            # we need to reparse the html until it stops changing
            continue # next round

        return fragment

Some small tests to make sure this behaves correctly:

tests = [   #text should work
            ('<p>this is text</p>but this too', '<p>this is text</p>but this too'),
            # make sure we cant exploit removal of tags
            ('<<script></script>script> alert("Haha, I hacked your page."); <<script></script>/script>', ''),
            # try the same trick with attributes, gives an Exception
            ('<div on<script></script>load="alert("Haha, I hacked your page.");">1</div>',  Exception),
             # no tags should be skipped
            ('<script>bad</script><script>bad</script><script>bad</script>', ''),
            # leave valid tags but remove bad attributes
            ('<a href="good" onload="bad" onclick="bad" alt="good">1</div>', '<a href="good" alt="good">1</a>'),
]

for text, out in tests:
    try:
        res = clean_html(text)
        assert res == out, "%s => %s != %s" % (text, res, out)
    except out, e:
        assert isinstance(e, out), "Wrong exception %r" % e

Answered By: Jochen Ritzel

Answer 4

Use lxml.html.clean! It’s VERY easy!

from lxml.html.clean import clean_html
print clean_html(html)

Suppose the following html:

html = '''
<html>
 <head>
   <script type="text/javascript" src="evil-site"></script>
   <link rel="alternate" type="text/rss" src="evil-rss">
   <style>
     body {background-image: url(javascript:do_evil)};
     div {color: expression(evil)};
   </style>
 </head>
 <body onload="evil_function()">
    <!-- I am interpreted for EVIL! -->
   <a href="javascript:evil_function()">a link</a>
   <a href="#" onclick="evil_function()">another link</a>
   <p onclick="evil_function()">a paragraph</p>
   <div style="display: none">secret EVIL!</div>
   <object> of EVIL! </object>
   <iframe src="evil-site"></iframe>
   <form action="evil-site">
     Password: <input type="password" name="password">
   </form>
   <blink>annoying EVIL!</blink>
   <a href="evil-site">spam spam SPAM!</a>
   <image src="evil!">
 </body>
</html>'''

The results…

<html>
  <body>
    <div>
      <style>/* deleted */</style>
      <a href="">a link</a>
      <a href="#">another link</a>
      <p>a paragraph</p>
      <div>secret EVIL!</div>
      of EVIL!
      Password:
      annoying EVIL!
      <a href="evil-site">spam spam SPAM!</a>
      <img src="evil!">
    </div>
  </body>
</html>

You can customize the elements you want to clean and whatnot.

Answered By: nosklo

Answer 5

I prefer the lxml.html.clean solution, like nosklo points out. Here’s to also remove some empty tags:

from lxml import etree
from lxml.html import clean, fromstring, tostring

remove_attrs = ['class']
remove_tags = ['table', 'tr', 'td']
nonempty_tags = ['a', 'p', 'span', 'div']

cleaner = clean.Cleaner(remove_tags=remove_tags)

def squeaky_clean(html):
    clean_html = cleaner.clean_html(html)
    # now remove the useless empty tags
    root = fromstring(clean_html)
    context = etree.iterwalk(root) # just the end tag event
    for action, elem in context:
        clean_text = elem.text and elem.text.strip(' trn')
        if elem.tag in nonempty_tags and 
        not (len(elem) or clean_text): # no children nor text
            elem.getparent().remove(elem)
            continue
        elem.text = clean_text # if you want
        # and if you also wanna remove some attrs:
        for badattr in remove_attrs:
            if elem.attrib.has_key(badattr):
                del elem.attrib[badattr]
    return tostring(root)

Answered By: ducu

Answer 6

I modified Bryan‘s solution with BeautifulSoup to address the problem raised by Chris Drost. A little crude, but does the job:

from BeautifulSoup import BeautifulSoup, Comment

VALID_TAGS = {'strong': [],
              'em': [],
              'p': [],
              'ol': [],
              'ul': [],
              'li': [],
              'br': [],
              'a': ['href', 'title']
              }

def sanitize_html(value, valid_tags=VALID_TAGS):
    soup = BeautifulSoup(value)
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    [comment.extract() for comment in comments]
    # Some markup can be crafted to slip through BeautifulSoup's parser, so
    # we run this repeatedly until it generates the same output twice.
    newoutput = soup.renderContents()
    while 1:
        oldoutput = newoutput
        soup = BeautifulSoup(newoutput)
        for tag in soup.findAll(True):
            if tag.name not in valid_tags:
                tag.hidden = True
            else:
                tag.attrs = [(attr, value) for attr, value in tag.attrs if attr in valid_tags[tag.name]]
        newoutput = soup.renderContents()
        if oldoutput == newoutput:
            break
    return newoutput

Edit: Updated to support valid attributes.

Answered By: Kiran Jonnalagadda

Answer 7

I use FilterHTML. It’s simple and lets you define a well-controlled white-list, scrubs URLs and even matches attribute values against regex or have custom filtering functions per attribute. If used carefully it could be a safe solution. Here’s a simplified example from the readme:

import FilterHTML

# only allow:
#   <a> tags with valid href URLs
#   <img> tags with valid src URLs and measurements
whitelist = {
  'a': {
    'href': 'url',
    'target': [
      '_blank',
      '_self'
    ],
    'class': [
      'button'
    ]
  },
  'img': {
    'src': 'url',
    'width': 'measurement',
    'height': 'measurement'
  },
}

filtered_html = FilterHTML.filter_html(unfiltered_html, whitelist)

Answered By: codedvillain

Answer 8

You could use html5lib, which uses a whitelist to sanitize.

An example:

import html5lib
from html5lib import sanitizer, treebuilders, treewalkers, serializer

def clean_html(buf):
    """Cleans HTML of dangerous tags and content."""
    buf = buf.strip()
    if not buf:
        return buf

    p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"),
            tokenizer=sanitizer.HTMLSanitizer)
    dom_tree = p.parseFragment(buf)

    walker = treewalkers.getTreeWalker("dom")
    stream = walker(dom_tree)

    s = serializer.htmlserializer.HTMLSerializer(
            omit_optional_tags=False,
            quote_attr_values=True)
    return s.render(stream)

Answered By: Brian Neal

Answer 9

Bleach does better with more useful options. It’s built on html5lib and ready for production. Check the documentation for the bleach.clean function. Its default configuration escapes unsafe tags like <script> while allowing useful tags like <a>.

import bleach
bleach.clean("<script>evil</script> <a href='http://example.com'>example</a>")
# '&lt;script&gt;evil&lt;/script&gt; <a href="http://example.com">example</a>'

Answered By: chuangbo

Remove HTML tags not on an allowed list from a Python string

Question:

Answers: