Writing UTF-8 String to MySQL with Python

Question:

I am trying to push user account data from an Active Directory to our MySQL-Server. This works flawlessly but somehow the strings end up showing an encoded version of umlauts and other special characters.

The Active Directory returns a string using this sample format: Mxc3xbcller

This actually is the UTF-8 encoding for Müller, but I want to write Müller to my database not Mxc3xbcller.

I tried converting the string with this line, but it results in the same string in the database:
tempEntry[1] = tempEntry[1].decode("utf-8")

If I run print "Mxc3xbcller".decode("utf-8") in the python console the output is correct.

Is there any way to insert this string the right way? I need this specific format for a web developer who wants to have this exact format, I don’t know why he is not able to convert the string using PHP directly.

Additional info: I am using MySQLdb; The table and column encoding is utf8_general_ci

Asked By: Raptor

||

Answers:

Assuming you are using MySQLdb you need to pass use_unicode=True and charset=”utf8″ when creating your connection.

UPDATE:
If I run the following against a test table I get –

>>> db = MySQLdb.connect(host="localhost", user='root', passwd='passwd', db='sandbox', use_unicode=True, charset="utf8")
>>> c = db.cursor()
>>> c.execute("INSERT INTO last_names VALUES(%s)", (u'Mxfcller', ))
1L
>>> c.execute("SELECT * FROM last_names")
1L
>>> print c.fetchall()
(('Mxc3xbcller',),)

This is “the right way”, the characters are being stored and retrieved correctly, your friend writing the php script just isn’t handling the encoding correctly when outputting.

As Rob points out, use_unicode and charset combined is being verbose about the connection, but I have a natural paranoia about even the most useful python libraries outside of the standard library so I try to be explicit to make bugs easy to find if the library changes.

Answered By: marr75

As @marr75 suggests, make sure you set charset='utf8' on your connections. Setting use_unicode=True is not strictly necessary as it is implied by setting the charset.

Then make sure you are passing unicode objects to your db connection as it will encode it using the charset you passed to the cursor. If you are passing a utf8-encoded string, it will be doubly encoded when it reaches the database.

So, something like:

conn = MySQLdb.connect(host="localhost", user='root', password='', db='', charset='utf8')
data_from_ldap = 'Mxc3xbcller'
name = data_from_ldap.decode('utf8')
cursor = conn.cursor()
cursor.execute(u"INSERT INTO mytable SET name = %s", (name,))

You may also try forcing the connection to use utf8 by passing the init_command param, though I’m unsure if this is required. 5 mins testing should help you decide.

conn = MySQLdb.connect(charset='utf8', init_command='SET NAMES UTF8')

Also, and this is barely worth mentioning as 4.1 is so old, make sure you are using MySQL >= 4.1

Answered By: Rob Cowie

I found the solution to my problems. Decoding the String with .decode('unicode_escape').encode('iso8859-1').decode('utf8') did work at last. Now everything is inserted as it should. The full other solution can be found here: Working with unicode encoded Strings from Active Directory via python-ldap

Answered By: Raptor

and db.set_character_set(‘utf8’), imply that
use_unicode=True ?

Answered By: Sérgio

(Would like to reply to above answer but do not have enough reputation…)

The reason why you don’t get unicode results in this case:

>>> print c.fetchall()
(('Mxc3xbcller',),)

is a bug from MySQLdb 1.2.x with *_bin collation, see:

http://sourceforge.net/tracker/index.php?func=detail&aid=1693363&group_id=22307&atid=374932
http://sourceforge.net/tracker/index.php?func=detail&aid=2663436&group_id=22307&atid=374932

In this particular case (collation utf8_bin – or [anything]_bin…) you have to expect the “raw” value, here utf-8 (yes, this sucks as there is no generic fix).

Answered By: lacorbeille
import MySQLdb

# connect to the database
db = MySQLdb.connect("****", "****", "****", "****") #don't use charset here

# setup a cursor object using cursor() method
cursor = db.cursor()

cursor.execute("SET NAMES utf8mb4;") #or utf8 or any other charset you want to handle

cursor.execute("SET CHARACTER SET utf8mb4;") #same as above

cursor.execute("SET character_set_connection=utf8mb4;") #same as above

# run a SQL question
cursor.execute("****")

#and make sure the MySQL settings are correct, data too
Answered By: YEH

Recently I had the same issue with field value being a byte string instead of unicode. Here’s a little analysis.

Overview

In general all one needs to do to have unicode values from a cursor, is to pass charset argument to connection constructor and have non-binary table fields (e.g. utf8_general_ci). Passing use_unicode is useless because it is set to true whenever charset has a value.

MySQLdb respects cursor description field types, so if you have a DATETIME column in cursor the values will be converted to Python datatime.datetime instances, DECIMAL to decimal.Decimal and so on, but binary values will be represented as is, by byte strings. Most of decoders are defined in MySQLdb.converters, and one can override them on instance basis by providing conv argument to connection constructor.

But unicode decoders are an exception here, which is likely a design shortcoming. They are appended directly to connection instance converters in its constructor. So it’s only possible to override them on instance-basic.

Workaround

Let’s see the issue code.

import MySQLdb

connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
cursor     = connection.cursor()

cursor.execute(u"SELECT 'abcdё' `s`, ExtractValue('<a>abcdё</a>', '/a') `b`")

print cursor.fetchone() 
# (u'abcdu0451', 'abcdxd1x91')
print cursor.description 
# (('s', 253, 6, 15, 15, 31, 0), ('b', 251, 6, 50331648, 50331648, 31, 1))
print cursor.description_flags 
# (1, 0)

It shows that b field is returned as a byte string instead of unicode. However it is not binary, MySQLdb.constants.FLAG.BINARY & cursor.description_flags[1] (MySQLdb field flags). It seems like bug in the library (opened #90). But the reason for it I see as MySQLdb.constants.FIELD_TYPE.LONG_BLOB (cursor.description[1][1] == 251, MySQLdb field types) just hasn’t a converter at all.

import MySQLdb
import MySQLdb.converters as conv
import MySQLdb.constants as const

connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
connection.converter[const.FIELD_TYPE.LONG_BLOB] = connection.converter[const.FIELD_TYPE.BLOB]
cursor = connection.cursor()

cursor.execute(u"SELECT 'abcdё' `s`, ExtractValue('<a>abcdё</a>', '/a') `b`")

print cursor.fetchone()
# (u'abcdu0451', u'abcdu0451')
print cursor.description
# (('s', 253, 6, 15, 15, 31, 0), ('b', 251, 6, 50331648, 50331648, 31, 1))
print cursor.description_flags
# (1, 0)

Thus by manipulating connection instance converter dict, it is possible to achieve desired unicode decoding behaviour.

If you want to override the behaviour here’s how a dict entry for possible text field looks like after constructor.

import MySQLdb
import MySQLdb.constants as const

connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
print connection.converter[const.FIELD_TYPE.BLOB]
# [(128, <type 'str'>), (None, <function string_decoder at 0x7fa472dda488>)]

MySQLdb.constants.FLAG.BINARY == 128. This means that if a field has binary flag it will be str, otherwise unicode decoder will be applied. So you want to try to convert binary values as well, you can pop the first tuple.

Answered By: saaj

there is another situation maybe a little rare.

if you create a schema in mysqlworkbench firstly,you will get the encoding error and can’t solve it by add charset configuration.

it is because mysqlworkbench create schema by latin1 by default, so you should set the charset at first!
enter image description here

Answered By: dogewang
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.