UTF-8 In Python logging, how?

Question:

I’m trying to log a UTF-8 encoded string to a file using Python’s logging package. As a toy example:

import logging

def logging_test():
    handler = logging.FileHandler("/home/ted/logfile.txt", "w",
                                  encoding = "UTF-8")
    formatter = logging.Formatter("%(message)s")
    handler.setFormatter(formatter)
    root_logger = logging.getLogger()
    root_logger.addHandler(handler)
    root_logger.setLevel(logging.INFO)

    # This is an o with a hat on it.
    byte_string = 'xc3xb4'
    unicode_string = unicode("xc3xb4", "utf-8")

    print "printed unicode object: %s" % unicode_string

    # Explode
    root_logger.info(unicode_string)

if __name__ == "__main__":
    logging_test()

This explodes with UnicodeDecodeError on the logging.info() call.

At a lower level, Python’s logging package is using the codecs package to open the log file, passing in the “UTF-8” argument as the encoding. That’s all well and good, but it’s trying to write byte strings to the file instead of unicode objects, which explodes. Essentially, Python is doing this:

file_handler.write(unicode_string.encode("UTF-8"))

When it should be doing this:

file_handler.write(unicode_string)

Is this a bug in Python, or am I taking crazy pills? FWIW, this is a stock Python 2.6 installation.

Asked By: Ted Dziuba

||

Answers:

Try this:

import logging

def logging_test():
    log = open("./logfile.txt", "w")
    handler = logging.StreamHandler(log)
    formatter = logging.Formatter("%(message)s")
    handler.setFormatter(formatter)
    root_logger = logging.getLogger()
    root_logger.addHandler(handler)
    root_logger.setLevel(logging.INFO)

    # This is an o with a hat on it.
    byte_string = 'xc3xb4'
    unicode_string = unicode("xc3xb4", "utf-8")

    print "printed unicode object: %s" % unicode_string

    # Explode
    root_logger.info(unicode_string.encode("utf8", "replace"))


if __name__ == "__main__":
    logging_test()

For what it’s worth I was expecting to have to use codecs.open to open the file with utf-8 encoding but either that’s the default or something else is going on here, since it works as is like this.

Answered By: John

Check that you have the latest Python 2.6 – some Unicode bugs were found and fixed since 2.6 came out. For example, on my Ubuntu Jaunty system, I ran your script copied and pasted, removing only the ‘/home/ted/’ prefix from the log file name. Result (copied and pasted from a terminal window):

vinay@eta-jaunty:~/projects/scratch$ python --version
Python 2.6.2
vinay@eta-jaunty:~/projects/scratch$ python utest.py 
printed unicode object: ô
vinay@eta-jaunty:~/projects/scratch$ cat logfile.txt 
ô
vinay@eta-jaunty:~/projects/scratch$ 

On a Windows box:

C:temp>python --version
Python 2.6.2

C:temp>python utest.py
printed unicode object: ô

And the contents of the file:

alt text

This might also explain why Lennart Regebro couldn’t reproduce it either.

Answered By: Vinay Sajip

If I understood your problem correctly, the same issue should arise on your system when you do just:

str(u'ô')

I guess automatic encoding to the locale encoding on Unix will not work until you have enabled locale-aware if branch in the setencoding function in your site module via locale. This file usually resides in /usr/lib/python2.x, it worth inspecting anyway. AFAIK, locale-aware setencoding is disabled by default (it’s true for my Python 2.6 installation).

The choices are:

  • Let the system figure out the right way to encode Unicode strings to bytes or do it in your code (some configuration in site-specific site.py is needed)
  • Encode Unicode strings in your code and output just bytes

See also The Illusive setdefaultencoding by Ian Bicking and related links.

Answered By: Andrey Vlasovskikh

Having code like:

raise Exception(u'щ')

Caused:

  File "/usr/lib/python2.7/logging/__init__.py", line 467, in format
    s = self._fmt % record.__dict__
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

This happens because the format string is a byte string, while some of the format string arguments are unicode strings with non-ASCII characters:

>>> "%(message)s" % {'message': Exception(u'u0449')}
*** UnicodeEncodeError: 'ascii' codec can't encode character u'u0449' in position 0: ordinal not in range(128)

Making the format string unicode fixes the issue:

>>> u"%(message)s" % {'message': Exception(u'u0449')}
u'u0449'

So, in your logging configuration make all format string unicode:

'formatters': {
    'simple': {
        'format': u'%(asctime)-s %(levelname)s [%(name)s]: %(message)s',
        'datefmt': '%Y-%m-%d %H:%M:%S',
    },
 ...

And patch the default logging formatter to use unicode format string:

logging._defaultFormatter = logging.Formatter(u"%(message)s")
Answered By: warvariuc

I had a similar problem running Django in Python3: My logger died upon encountering some Umlauts (äöüß) but was otherwise fine. I looked through a lot of results and found none working. I tried

import locale; 
if locale.getpreferredencoding().upper() != 'UTF-8': 
    locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') 

which I got from the comment above.
It did not work. Looking at the current locale gave me some crazy ANSI thing, which turned out to mean basically just “ASCII”. That sent me into totally the wrong direction.

Changing the logging format-strings to Unicode would not help.
Setting a magic encoding comment at the beginning of the script would not help.
Setting the charset on the sender’s message (the text came from a HTTP-reqeust) did not help.

What DID work was setting the encoding on the file-handler to UTF-8 in settings.py. Because I had nothing set, the default would become None. Which apparently ends up being ASCII (or as I’d like to think about: ASS-KEY)

    'handlers': {
        'file': {
            'level': 'DEBUG',
            'class': 'logging.handlers.TimedRotatingFileHandler',
            'encoding': 'UTF-8', # <-- That was missing.
            ....
        },
    },
Answered By: Chris

I’m a little late, but I just came across this post that enabled me to set up logging in utf-8 very easily

Here the link to the post

or here the code:

root_logger= logging.getLogger()
root_logger.setLevel(logging.DEBUG) # or whatever
handler = logging.FileHandler('test.log', 'w', 'utf-8') # or whatever
formatter = logging.Formatter('%(name)s %(message)s') # or whatever
handler.setFormatter(formatter) # Pass handler as a parameter, not assign
root_logger.addHandler(handler)
Answered By: Ephie
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.