Printing a utf-8 encoded string
Question:
I’m using BeautifulSoup to extract some text from an HTML but I just can’t figure out how to print it properly to the screen (or to a file for that matter).
Here’s how my class containing the text looks like:
class Thread(object):
def __init__(self, title, author, date, content = u""):
self.title = title
self.author = author
self.date = date
self.content = content
self.replies = []
def __unicode__(self):
s = u""
for k, v in self.__dict__.items():
s += u"%s = %s " % (k, v)
return s
def __repr__(self):
return repr(unicode(self))
__str__ = __repr__
When trying to print an instance of Thread
here’s what I see on the console:
~/python-tests $ python test.py
u'date = 21:01 03/02/11 content = author = u05d3"u05e8 u05d9u05d5u05e0u05d9 u05e1u05d8u05d0u05e0u05e6'u05e1u05e7u05d5 replies = [] title = u05deu05d1u05e0u05d4 u05d4u05deu05d1u05d7u05df '
Whatever I try I cannot get the output I’d like (the above text should be Hebrew). My end goal is to serialize Thread
to a file (using json or pickle) and be able to read it back.
I’m running this with Python 2.6.6 on Ubuntu 10.10.
Answers:
To output a Unicode string to a file (or the console) you need to choose a text encoding. In Python the default text encoding is ASCII, but to support Hebrew characters you need to use a different encoding, such as UTF-8:
s = unicode(your_object).encode('utf8')
f.write(s)
A nice alternative to @mark’s answer is to set the environment variable PYTHONIOENCODING=UTF-8
.
c.f. Writing unicode strings via sys.stdout in Python.
(Make sure to set it prior to starting Python not in the script.)
I’m using BeautifulSoup to extract some text from an HTML but I just can’t figure out how to print it properly to the screen (or to a file for that matter).
Here’s how my class containing the text looks like:
class Thread(object):
def __init__(self, title, author, date, content = u""):
self.title = title
self.author = author
self.date = date
self.content = content
self.replies = []
def __unicode__(self):
s = u""
for k, v in self.__dict__.items():
s += u"%s = %s " % (k, v)
return s
def __repr__(self):
return repr(unicode(self))
__str__ = __repr__
When trying to print an instance of Thread
here’s what I see on the console:
~/python-tests $ python test.py
u'date = 21:01 03/02/11 content = author = u05d3"u05e8 u05d9u05d5u05e0u05d9 u05e1u05d8u05d0u05e0u05e6'u05e1u05e7u05d5 replies = [] title = u05deu05d1u05e0u05d4 u05d4u05deu05d1u05d7u05df '
Whatever I try I cannot get the output I’d like (the above text should be Hebrew). My end goal is to serialize Thread
to a file (using json or pickle) and be able to read it back.
I’m running this with Python 2.6.6 on Ubuntu 10.10.
To output a Unicode string to a file (or the console) you need to choose a text encoding. In Python the default text encoding is ASCII, but to support Hebrew characters you need to use a different encoding, such as UTF-8:
s = unicode(your_object).encode('utf8')
f.write(s)
A nice alternative to @mark’s answer is to set the environment variable PYTHONIOENCODING=UTF-8
.
c.f. Writing unicode strings via sys.stdout in Python.
(Make sure to set it prior to starting Python not in the script.)