UnicodeEncodeError when writing to a file

Question:

I am trying to write some strings to a file (the strings have been given to me by the HTML parser BeautifulSoup).

I can use “print” to display them, but when I use file.write() I get the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'xa3' in position 6: ordinal not in range(128)

How can I parse this?

Asked By: Ivy

||

Answers:

This error occurs when you pass a Unicode string containing non-English characters (Unicode characters beyond 128) to something that expects an ASCII bytestring. The default encoding for a Python bytestring is ASCII, “which handles exactly 128 (English) characters”. This is why trying to convert Unicode characters beyond 128 produces the error.

The unicode()

unicode(string[, encoding, errors])

constructor has the signature unicode(string[, encoding, errors]). All of its arguments should be 8-bit strings.

The first argument is converted to Unicode using the specified encoding; if you leave off the encoding argument, the ASCII encoding is used for the conversion, so characters greater than 127 will be treated as errors

for example

s = u'La Pexf1a' 
print s.encode('latin-1')

or

write(s.encode('latin-1'))

will encode using latin-1

Answered By: yossi

If I type ‘python unicode’ into Google, I get about 14 million results; the first is the official doc which describes the whole situation in excruciating detail; and the fourth is a more practical overview that will pretty much spoon-feed you an answer, and also make sure you understand what’s going on.

You really do need to read and understand these sorts of overviews, however long they seem. There really isn’t any getting around it. Text is hard. There is no such thing as “plain text”, there hasn’t been a reasonable facsimile for years, and there never really was, although we spent decades pretending there was. But Unicode is at least a standard.

You also should read http://www.joelonsoftware.com/articles/Unicode.html .

Answered By: Karl Knechtel

The answer to your question is “use codecs”. The appeded code also shows some gettext magic, FWIW. http://wiki.wxpython.org/Internationalization

import codecs

import gettext

localedir = './locale'
langid = wx.LANGUAGE_DEFAULT # use OS default; or use LANGUAGE_JAPANESE, etc.
domain = "MyApp"             
mylocale = wx.Locale(langid)
mylocale.AddCatalogLookupPathPrefix(localedir)
mylocale.AddCatalog(domain)

translater = gettext.translation(domain, localedir, 
                                 [mylocale.GetCanonicalName()], fallback = True)
translater.install(unicode = True)

# translater.install() installs the gettext _() translater function into our namespace...

msg = _("A message that gettext will translate, probably putting Unicode in here")

# use codecs.open() to convert Unicode strings to UTF8

Logfile = codecs.open(logfile_name, 'w', encoding='utf-8')

Logfile.write(msg + 'n')

Despite Google being full of hits on this problem, I found it rather hard to find this simple solution (it is actually in the Python docs about Unicode, but rather burried).

So … HTH…

GaJ

Answered By: GreenAsJade

I tried this it works fine

with open(r"C:ragsampleoutput.txt", ‘w’,encoding="utf-8") as f:

Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.