Python, .format(), and UTF-8

Question:

My background is in Perl, but I’m giving Python plus BeautifulSoup a try for a new project.

In this example, I’m trying to extract and present the link targets and link text contained in a single page. Here’s the source:

table_row = u'<tr><td>{}</td><td>{}</td></tr>'.encode('utf-8')
link_text = unicode(link.get_text()).encode('utf-8')
link_target = link['href'].encode('utf-8')
line_out = unicode(table_row.format(link_text, link_target))

All those explicit calls to .encode(‘utf-8’) are my attempt to make this work, but they don’t seem to help — it’s likely that I’m completely misunderstanding something about how Python 2.7 handles Unicode strings.

Anyway. This works fine up until it encounters U+2013 in a URL (yes, really). At that point it bombs out with:

Traceback (most recent call last):
File "./test2.py", line 30, in <module>
  line_out = unicode(table_row.encode('utf-8').format(link_text, link_target.encode('utf-8')))
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 79: ordinal not in range(128)

Presumably .format(), even applied to a Unicode string, is playing silly-buggers and trying to do a .decode() operation. And as ASCII is the default, it’s using that, and of course it can’t map U+2013 to an ASCII character, and thus…

The options seem to be to remove it or convert it to something else, but really what I want is to simply preserve it. Ultimately (this is just a little test case) I need to be able to present working clickable links.

The BS3 documentation suggests changing the default encoding from ASCII to UTF-8 but reading comments on similar questions that looks to be a really bad idea as it’ll muck up dictionaries.

Short of using Python 3.2 instead (which means no Django, which we’re considering for part of this project) is there some way to make this work cleanly?

Asked By: Matt McLeod

||

Answers:

First, note that your two code samples disagree on the text of the problematic line:

line_out = unicode(table_row.encode('utf-8').format(link_text, link_target.encode('utf-8')))

vs

line_out = unicode(table_row.format(link_text, link_target))

The first is the one from the traceback, so it’s the one to look at. Assuming the rest of your first code sample is accurate, table_row is a byte-string, because you took a unicode string and encoded it. Byte strings can’t be encoded, so Python 2 implicitly converts table_row from byte-string to unicode by decoding it as ascii. Hence the error message, "UnicodeDecodeError from ascii".

You need to decide what strings will be byte strings and which will be unicode strings, and be disciplined about it. I recommend keeping all text as Unicode strings as much as possible.

Here’s a presentation I gave at PyCon that explains it all: Pragmatic Unicode, or, How Do I Stop The Pain?

Answered By: Ned Batchelder
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.