How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?
Question:
I’m using Python and Django, but I’m having a problem caused by a limitation of MySQL. According to the MySQL 5.1 documentation, their utf8
implementation does not support 4-byte characters. MySQL 5.5 will support 4-byte characters using utf8mb4
; and, someday in future, utf8
might support it as well.
But my server is not ready to upgrade to MySQL 5.5, and thus I’m limited to UTF-8 characters that take 3 bytes or less.
My question is: How to filter (or replace) unicode characters that would take more than 3 bytes?
I want to replace all 4-byte characters with the official ufffd
(U+FFFD REPLACEMENT CHARACTER), or with ?
.
In other words, I want a behavior quite similar to Python’s own str.encode()
method (when passing 'replace'
parameter). Edit: I want a behavior similar to encode()
, but I don’t want to actually encode the string. I want to still have an unicode string after filtering.
I DON’T want to escape the character before storing at the MySQL, because that would mean I would need to unescape all strings I get from the database, which is very annoying and unfeasible.
See also:
- "Incorrect string value" warning when saving some unicode characters to MySQL (at Django ticket system)
- ‘ ’ Not a valid unicode character, but in the unicode character set? (at Stack Overflow)
[EDIT] Added tests about the proposed solutions
So I got good answers so far. Thanks, people! Now, in order to choose one of them, I did a quick testing to find the simplest and fastest one.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# vi_ts=4 sw=4 et
import cProfile
import random
import re
# How many times to repeat each filtering
repeat_count = 256
# Percentage of "normal" chars, when compared to "large" unicode chars
normal_chars = 90
# Total number of characters in this string
string_size = 8 * 1024
# Generating a random testing string
test_string = u''.join(
unichr(random.randrange(32,
0x10ffff if random.randrange(100) > normal_chars else 0x0fff
)) for i in xrange(string_size) )
# RegEx to find invalid characters
re_pattern = re.compile(u'[^u0000-uD7FFuE000-uFFFF]', re.UNICODE)
def filter_using_re(unicode_string):
return re_pattern.sub(u'uFFFD', unicode_string)
def filter_using_python(unicode_string):
return u''.join(
uc if uc < u'ud800' or u'ue000' <= uc <= u'uffff' else u'ufffd'
for uc in unicode_string
)
def repeat_test(func, unicode_string):
for i in xrange(repeat_count):
tmp = func(unicode_string)
print '='*10 + ' filter_using_re() ' + '='*10
cProfile.run('repeat_test(filter_using_re, test_string)')
print '='*10 + ' filter_using_python() ' + '='*10
cProfile.run('repeat_test(filter_using_python, test_string)')
#print test_string.encode('utf8')
#print filter_using_re(test_string).encode('utf8')
#print filter_using_python(test_string).encode('utf8')
The results:
filter_using_re()
did 515 function calls in 0.139 CPU seconds (0.138 CPU seconds at the sub()
built-in)
filter_using_python()
did 2097923 function calls in 3.413 CPU seconds (1.511 CPU seconds at the join()
call and 1.900 CPU seconds evaluating the generator expression)
- I did no test using
itertools
because… well… that solution, although interesting, was quite big and complex.
Conclusion
The RegEx solution was, by far, the fastest one.
Answers:
Unicode characters in the ranges u0000-uD7FF and uE000-uFFFF will have 3 byte (or less) encodings in UTF8. The uD800-uDFFF range is for multibyte UTF16. I do not know python, but you should be able to set up a regular expression to match outside those ranges.
pattern = re.compile("[uD800-uDFFF].", re.UNICODE)
pattern = re.compile("[^u0000-uFFFF]", re.UNICODE)
Edit adding Python from Denilson Sá’s script in the question body:
re_pattern = re.compile(u'[^u0000-uD7FFuE000-uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'uFFFD', unicode_string)
Encode as UTF-16, then reencode as UTF-8.
>>> t = u' '
>>> e = t.encode('utf-16le')
>>> ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
'xedxa0xb5xedxb0x9fxedxa0xb5xedxb0xa8xedxa0xb5xedxb0xa8'
Note that you can’t encode after joining, since the surrogate pairs may be decoded before reencoding.
EDIT:
MySQL (at least 5.1.47) has no problem dealing with surrogate pairs:
mysql> create table utf8test (t character(128)) collate utf8_general_ci;
Query OK, 0 rows affected (0.12 sec)
...
>>> cxn = MySQLdb.connect(..., charset='utf8')
>>> csr = cxn.cursor()
>>> t = u' '
>>> e = t.encode('utf-16le')
>>> v = ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
>>> v
'xedxa0xb5xedxb0x9fxedxa0xb5xedxb0xa8xedxa0xb5xedxb0xa8'
>>> csr.execute('insert into utf8test (t) values (%s)', (v,))
1L
>>> csr.execute('select * from utf8test')
1L
>>> r = csr.fetchone()
>>> r
(u'ud835udc1fud835udc28ud835udc28',)
>>> print r[0]
I’m guessing it’s not the fastest, but quite straightforward (“pythonic” 🙂 :
def max3bytes(unicode_string):
return u''.join(uc if uc <= u'uffff' else u'ufffd' for uc in unicode_string)
NB: this code does not take into account the fact that Unicode has surrogate characters in the ranges U+D800-U+DFFF.
And just for the fun of it, an itertools
monstrosity 🙂
import itertools as it, operator as op
def max3bytes(unicode_string):
# sequence of pairs of (char_in_string, u'N{REPLACEMENT CHARACTER}')
pairs= it.izip(unicode_string, it.repeat(u'ufffd'))
# is the argument less than or equal to 65535?
selector= ft.partial(op.le, 65535)
# using the character ordinals, return 0 or 1 based on `selector`
indexer= it.imap(selector, it.imap(ord, unicode_string))
# now pick the correct item for all pairs
return u''.join(it.imap(tuple.__getitem__, pairs, indexer))
According to the MySQL 5.1 documentation: “The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP.” This indicates that there might be a problem with surrogate pairs.
Note that the Unicode standard 5.2 chapter 3 actually forbids encoding a surrogate pair as two 3-byte UTF-8 sequences instead of one 4-byte UTF-8 sequence … see for example page 93 “””Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.””” However this proscription is as far as I know largely unknown or ignored.
It may well be a good idea to check what MySQL does with surrogate pairs. If they are not to be retained, this code will provide a simple-enough check:
all(uc < u'ud800' or u'ue000' <= uc <= u'uffff' for uc in unicode_string)
and this code will replace any “nasties” with uufffd
:
u''.join(
uc if uc < u'ud800' or u'ue000' <= uc <= u'uffff' else u'ufffd'
for uc in unicode_string
)
You may skip the decoding and encoding steps and directly detect the value of the first byte (8-bit string) of each character. According to UTF-8:
#1-byte characters have the following format: 0xxxxxxx
#2-byte characters have the following format: 110xxxxx 10xxxxxx
#3-byte characters have the following format: 1110xxxx 10xxxxxx 10xxxxxx
#4-byte characters have the following format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
According to that, you only need to check the value of only the first byte of each character to filter out 4-byte characters:
def filter_4byte_chars(s):
i = 0
j = len(s)
# you need to convert
# the immutable string
# to a mutable list first
s = list(s)
while i < j:
# get the value of this byte
k = ord(s[i])
# this is a 1-byte character, skip to the next byte
if k <= 127:
i += 1
# this is a 2-byte character, skip ahead by 2 bytes
elif k < 224:
i += 2
# this is a 3-byte character, skip ahead by 3 bytes
elif k < 240:
i += 3
# this is a 4-byte character, remove it and update
# the length of the string we need to check
else:
s[i:i+4] = []
j -= 4
return ''.join(s)
Skipping the decoding and encoding parts will save you some time and for smaller strings that mostly have 1-byte characters this could even be faster than the regular expression filtering.
This does more than filtering out just 3+ byte UTF-8 unicode characters. It removes unicode but tries to do that in a gentle way and replace it with relevant ASCII characters if possible. It can be a blessing in the future if you don’t have for example a dozen of various unicode apostrophes and unicode quotation marks in your text (usually coming from Apple handhelds) but only the regular ASCII apostrophes and quotations.
unicodedata.normalize("NFKD", sentence).encode("ascii", "ignore")
This is robust, I use it with some more guards:
import unicodedata
def neutralize_unicode(value):
"""
Taking care of special characters as gently as possible
Args:
value (string): input string, can contain unicode characters
Returns:
:obj:`string` where the unicode characters are replaced with standard
ASCII counterparts (for example en-dash and em-dash with regular dash,
apostrophe and quotation variations with the standard ones) or taken
out if there's no substitute.
"""
if not value or not isinstance(value, basestring):
return value
if isinstance(value, str):
return value
return unicodedata.normalize("NFKD", value).encode("ascii", "ignore")
This is Python 2 BTW.
I’m using Python and Django, but I’m having a problem caused by a limitation of MySQL. According to the MySQL 5.1 documentation, their utf8
implementation does not support 4-byte characters. MySQL 5.5 will support 4-byte characters using utf8mb4
; and, someday in future, utf8
might support it as well.
But my server is not ready to upgrade to MySQL 5.5, and thus I’m limited to UTF-8 characters that take 3 bytes or less.
My question is: How to filter (or replace) unicode characters that would take more than 3 bytes?
I want to replace all 4-byte characters with the official ufffd
(U+FFFD REPLACEMENT CHARACTER), or with ?
.
In other words, I want a behavior quite similar to Python’s own str.encode()
method (when passing 'replace'
parameter). Edit: I want a behavior similar to encode()
, but I don’t want to actually encode the string. I want to still have an unicode string after filtering.
I DON’T want to escape the character before storing at the MySQL, because that would mean I would need to unescape all strings I get from the database, which is very annoying and unfeasible.
See also:
- "Incorrect string value" warning when saving some unicode characters to MySQL (at Django ticket system)
- ‘ ’ Not a valid unicode character, but in the unicode character set? (at Stack Overflow)
[EDIT] Added tests about the proposed solutions
So I got good answers so far. Thanks, people! Now, in order to choose one of them, I did a quick testing to find the simplest and fastest one.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# vi_ts=4 sw=4 et
import cProfile
import random
import re
# How many times to repeat each filtering
repeat_count = 256
# Percentage of "normal" chars, when compared to "large" unicode chars
normal_chars = 90
# Total number of characters in this string
string_size = 8 * 1024
# Generating a random testing string
test_string = u''.join(
unichr(random.randrange(32,
0x10ffff if random.randrange(100) > normal_chars else 0x0fff
)) for i in xrange(string_size) )
# RegEx to find invalid characters
re_pattern = re.compile(u'[^u0000-uD7FFuE000-uFFFF]', re.UNICODE)
def filter_using_re(unicode_string):
return re_pattern.sub(u'uFFFD', unicode_string)
def filter_using_python(unicode_string):
return u''.join(
uc if uc < u'ud800' or u'ue000' <= uc <= u'uffff' else u'ufffd'
for uc in unicode_string
)
def repeat_test(func, unicode_string):
for i in xrange(repeat_count):
tmp = func(unicode_string)
print '='*10 + ' filter_using_re() ' + '='*10
cProfile.run('repeat_test(filter_using_re, test_string)')
print '='*10 + ' filter_using_python() ' + '='*10
cProfile.run('repeat_test(filter_using_python, test_string)')
#print test_string.encode('utf8')
#print filter_using_re(test_string).encode('utf8')
#print filter_using_python(test_string).encode('utf8')
The results:
filter_using_re()
did 515 function calls in 0.139 CPU seconds (0.138 CPU seconds at thesub()
built-in)filter_using_python()
did 2097923 function calls in 3.413 CPU seconds (1.511 CPU seconds at thejoin()
call and 1.900 CPU seconds evaluating the generator expression)- I did no test using
itertools
because… well… that solution, although interesting, was quite big and complex.
Conclusion
The RegEx solution was, by far, the fastest one.
Unicode characters in the ranges u0000-uD7FF and uE000-uFFFF will have 3 byte (or less) encodings in UTF8. The uD800-uDFFF range is for multibyte UTF16. I do not know python, but you should be able to set up a regular expression to match outside those ranges.
pattern = re.compile("[uD800-uDFFF].", re.UNICODE)
pattern = re.compile("[^u0000-uFFFF]", re.UNICODE)
Edit adding Python from Denilson Sá’s script in the question body:
re_pattern = re.compile(u'[^u0000-uD7FFuE000-uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'uFFFD', unicode_string)
Encode as UTF-16, then reencode as UTF-8.
>>> t = u' '
>>> e = t.encode('utf-16le')
>>> ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
'xedxa0xb5xedxb0x9fxedxa0xb5xedxb0xa8xedxa0xb5xedxb0xa8'
Note that you can’t encode after joining, since the surrogate pairs may be decoded before reencoding.
EDIT:
MySQL (at least 5.1.47) has no problem dealing with surrogate pairs:
mysql> create table utf8test (t character(128)) collate utf8_general_ci;
Query OK, 0 rows affected (0.12 sec)
...
>>> cxn = MySQLdb.connect(..., charset='utf8')
>>> csr = cxn.cursor()
>>> t = u' '
>>> e = t.encode('utf-16le')
>>> v = ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
>>> v
'xedxa0xb5xedxb0x9fxedxa0xb5xedxb0xa8xedxa0xb5xedxb0xa8'
>>> csr.execute('insert into utf8test (t) values (%s)', (v,))
1L
>>> csr.execute('select * from utf8test')
1L
>>> r = csr.fetchone()
>>> r
(u'ud835udc1fud835udc28ud835udc28',)
>>> print r[0]
I’m guessing it’s not the fastest, but quite straightforward (“pythonic” 🙂 :
def max3bytes(unicode_string):
return u''.join(uc if uc <= u'uffff' else u'ufffd' for uc in unicode_string)
NB: this code does not take into account the fact that Unicode has surrogate characters in the ranges U+D800-U+DFFF.
And just for the fun of it, an itertools
monstrosity 🙂
import itertools as it, operator as op
def max3bytes(unicode_string):
# sequence of pairs of (char_in_string, u'N{REPLACEMENT CHARACTER}')
pairs= it.izip(unicode_string, it.repeat(u'ufffd'))
# is the argument less than or equal to 65535?
selector= ft.partial(op.le, 65535)
# using the character ordinals, return 0 or 1 based on `selector`
indexer= it.imap(selector, it.imap(ord, unicode_string))
# now pick the correct item for all pairs
return u''.join(it.imap(tuple.__getitem__, pairs, indexer))
According to the MySQL 5.1 documentation: “The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP.” This indicates that there might be a problem with surrogate pairs.
Note that the Unicode standard 5.2 chapter 3 actually forbids encoding a surrogate pair as two 3-byte UTF-8 sequences instead of one 4-byte UTF-8 sequence … see for example page 93 “””Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.””” However this proscription is as far as I know largely unknown or ignored.
It may well be a good idea to check what MySQL does with surrogate pairs. If they are not to be retained, this code will provide a simple-enough check:
all(uc < u'ud800' or u'ue000' <= uc <= u'uffff' for uc in unicode_string)
and this code will replace any “nasties” with uufffd
:
u''.join(
uc if uc < u'ud800' or u'ue000' <= uc <= u'uffff' else u'ufffd'
for uc in unicode_string
)
You may skip the decoding and encoding steps and directly detect the value of the first byte (8-bit string) of each character. According to UTF-8:
#1-byte characters have the following format: 0xxxxxxx
#2-byte characters have the following format: 110xxxxx 10xxxxxx
#3-byte characters have the following format: 1110xxxx 10xxxxxx 10xxxxxx
#4-byte characters have the following format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
According to that, you only need to check the value of only the first byte of each character to filter out 4-byte characters:
def filter_4byte_chars(s):
i = 0
j = len(s)
# you need to convert
# the immutable string
# to a mutable list first
s = list(s)
while i < j:
# get the value of this byte
k = ord(s[i])
# this is a 1-byte character, skip to the next byte
if k <= 127:
i += 1
# this is a 2-byte character, skip ahead by 2 bytes
elif k < 224:
i += 2
# this is a 3-byte character, skip ahead by 3 bytes
elif k < 240:
i += 3
# this is a 4-byte character, remove it and update
# the length of the string we need to check
else:
s[i:i+4] = []
j -= 4
return ''.join(s)
Skipping the decoding and encoding parts will save you some time and for smaller strings that mostly have 1-byte characters this could even be faster than the regular expression filtering.
This does more than filtering out just 3+ byte UTF-8 unicode characters. It removes unicode but tries to do that in a gentle way and replace it with relevant ASCII characters if possible. It can be a blessing in the future if you don’t have for example a dozen of various unicode apostrophes and unicode quotation marks in your text (usually coming from Apple handhelds) but only the regular ASCII apostrophes and quotations.
unicodedata.normalize("NFKD", sentence).encode("ascii", "ignore")
This is robust, I use it with some more guards:
import unicodedata
def neutralize_unicode(value):
"""
Taking care of special characters as gently as possible
Args:
value (string): input string, can contain unicode characters
Returns:
:obj:`string` where the unicode characters are replaced with standard
ASCII counterparts (for example en-dash and em-dash with regular dash,
apostrophe and quotation variations with the standard ones) or taken
out if there's no substitute.
"""
if not value or not isinstance(value, basestring):
return value
if isinstance(value, str):
return value
return unicodedata.normalize("NFKD", value).encode("ascii", "ignore")
This is Python 2 BTW.