Avoiding Python UnicodeDecodeError in Jinja's nl2br filter

Question:

I’m using Jinja2’s nl2br filter, which looks like:

import re
from jinja2 import environmentfilter, Markup, escape

_paragraph_re = re.compile(r'(?:rn|r|n){2,}')

@evalcontextfilter
def nl2br(eval_ctx, value):
    result = u'nn'.join(u'<p>%s</p>' % p.replace('n', '<br>n')
                      for p in _paragraph_re.split(escape(value)))
    if eval_ctx.autoescape:
        result = Markup(result)
    return result

The problem is if “value” has anything but ascii characters (for example: “/mɒnˈtænə/” causes it to fail). I get this error:

Traceback (most recent call last):
  File "/usr/local/lib/python2.6/dist-packages/Flask-0.6.1-py2.6.egg/flask/app.py", line 889, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/lib/python2.6/dist-packages/Flask-0.6.1-py2.6.egg/flask/app.py", line 879, in wsgi_app
    response = self.make_response(self.handle_exception(e))
  File "/usr/local/lib/python2.6/dist-packages/Flask-0.6.1-py2.6.egg/flask/app.py", line 876, in wsgi_app
    rv = self.dispatch_request()
  File "/usr/local/lib/python2.6/dist-packages/Flask-0.6.1-py2.6.egg/flask/app.py", line 695, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/mcrittenden/Dropbox/Code/dropdo/dropdo.py", line 105, in view
    return render_template(template, src = url, data = content)
  File "/usr/local/lib/python2.6/dist-packages/Flask-0.6.1-py2.6.egg/flask/templating.py", line 85, in render_template
    context, ctx.app)
  File "/usr/local/lib/python2.6/dist-packages/Flask-0.6.1-py2.6.egg/flask/templating.py", line 69, in _render
    rv = template.render(context)
  File "/usr/local/lib/python2.6/dist-packages/Jinja2-2.5.5-py2.6.egg/jinja2/environment.py", line 891, in render
    return self.environment.handle_exception(exc_info, True)
  File "/home/mcrittenden/Dropbox/Code/dropdo/templates/text.html", line 1, in top-level template code
    {% extends "layout.html" %}
  File "/home/mcrittenden/Dropbox/Code/dropdo/templates/layout.html", line 25, in top-level template code
    {% block content %}{% endblock %}
  File "/home/mcrittenden/Dropbox/Code/dropdo/templates/text.html", line 8, in block "content"
    {{ data|nl2br }}
  File "/home/mcrittenden/Dropbox/Code/dropdo/dropdo.py", line 26, in nl2br
    for p in _paragraph_re.split(escape(value)))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc9 in position 12: ordinal not in range(128)

What’s the best to prevent the error but not remove the problem characters altogether?

Asked By: Mike Crittenden

||

Answers:

Use unicode literals everywhere.

“Unicode in Python, Completely Demystified”

If “value” has anything but ascii characters, you want it to be Unicode, and nothing but Unicode, throughout your entire app, except for a few places where you explicitly encode or decode it. Pass Unicode to your templates, too.

If you acquire the string “/mɒnˈtænə/” somehow, you probably know its encoding; use it:
value = "/mɒnˈtænə/".decode(the_encoding).

How do you learn the encoding? A HTTP request knows its encoding. An XML file knows its encoding. A plain text file usually does not; you must know its encoding by some other means.

Note that UTF-8 is not Unicode though it is an encoding that can fully represent Unicode. It’s still an encoding, and to get a Python Unicode string from it, you need to .decode("utf-8") it.

Answered By: 9000

Try unidecode from http://pypi.python.org/pypi/Unidecode

>>> from unidecode  import unidecode
>>> m=u'My fiancxe9 David'; print m; print unidecode(m)
My fiancé David
My fiance David
>>> 
Answered By: JohnMudd
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.