Printing a utf-8 encoded string

Question:

I’m using BeautifulSoup to extract some text from an HTML but I just can’t figure out how to print it properly to the screen (or to a file for that matter).

Here’s how my class containing the text looks like:

class Thread(object):
    def __init__(self, title, author, date, content = u""):
        self.title = title
        self.author = author
        self.date = date
        self.content = content
        self.replies = []

    def __unicode__(self):
        s = u""

        for k, v in self.__dict__.items():
            s += u"%s = %s " % (k, v)

        return s

    def __repr__(self):
        return repr(unicode(self))

    __str__ = __repr__

When trying to print an instance of Thread here’s what I see on the console:

~/python-tests $ python test.py
u'date = 21:01 03/02/11 content =  author = u05d3"u05e8 u05d9u05d5u05e0u05d9 u05e1u05d8u05d0u05e0u05e6'u05e1u05e7u05d5 replies = [] title = u05deu05d1u05e0u05d4 u05d4u05deu05d1u05d7u05df '

Whatever I try I cannot get the output I’d like (the above text should be Hebrew). My end goal is to serialize Thread to a file (using json or pickle) and be able to read it back.

I’m running this with Python 2.6.6 on Ubuntu 10.10.

Asked By: daniel

||

Answers:

To output a Unicode string to a file (or the console) you need to choose a text encoding. In Python the default text encoding is ASCII, but to support Hebrew characters you need to use a different encoding, such as UTF-8:

s = unicode(your_object).encode('utf8')
f.write(s)
Answered By: Mark Byers

A nice alternative to @mark’s answer is to set the environment variable PYTHONIOENCODING=UTF-8.

c.f. Writing unicode strings via sys.stdout in Python.

(Make sure to set it prior to starting Python not in the script.)

Answered By: Nir Levy
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.