How do I get str.translate to work with Unicode strings?
Question:
I have the following code:
import string
def translate_non_alphanumerics(to_translate, translate_to='_'):
not_letters_or_digits = u'!"#%'()*+,-./:;<=>?@[]^_`{|}~'
translate_table = string.maketrans(not_letters_or_digits,
translate_to
*len(not_letters_or_digits))
return to_translate.translate(translate_table)
Which works great for non-unicode strings:
>>> translate_non_alphanumerics('<foo>!')
'_foo__'
But fails for unicode strings:
>>> translate_non_alphanumerics(u'<foo>!')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in translate_non_alphanumerics
TypeError: character mapping must return integer, None or unicode
I can’t make any sense of the paragraph on “Unicode objects” in the Python 2.6.2 docs for the str.translate() method.
How do I make this work for Unicode strings?
Answers:
The Unicode version of translate requires a mapping from Unicode ordinals (which you can retrieve for a single character with ord
) to Unicode ordinals. If you want to delete characters, you map to None
.
I changed your function to build a dict mapping the ordinal of every character to the ordinal of what you want to translate to:
def translate_non_alphanumerics(to_translate, translate_to=u'_'):
not_letters_or_digits = u'!"#%'()*+,-./:;<=>?@[]^_`{|}~'
translate_table = dict((ord(char), translate_to) for char in not_letters_or_digits)
return to_translate.translate(translate_table)
>>> translate_non_alphanumerics(u'<foo>!')
u'_foo__'
edit: It turns out that the translation mapping must map from the Unicode ordinal (via ord
) to either another Unicode ordinal, a Unicode string, or None (to delete). I have thus changed the default value for translate_to
to be a Unicode literal. For example:
>>> translate_non_alphanumerics(u'<foo>!', u'bad')
u'badfoobadbad'
I came up with the following combination of my original function and Mike‘s version that works with Unicode and ASCII strings:
def translate_non_alphanumerics(to_translate, translate_to=u'_'):
not_letters_or_digits = u'!"#%'()*+,-./:;<=>?@[]^_`{|}~'
if isinstance(to_translate, unicode):
translate_table = dict((ord(char), unicode(translate_to))
for char in not_letters_or_digits)
else:
assert isinstance(to_translate, str)
translate_table = string.maketrans(not_letters_or_digits,
translate_to
*len(not_letters_or_digits))
return to_translate.translate(translate_table)
Update: “coerced” translate_to
to unicode for the unicode translate_table
. Thanks Mike.
For a simple hack that will work on both str and unicode objects,
convert the translation table to unicode before running translate():
import string
def translate_non_alphanumerics(to_translate, translate_to='_'):
not_letters_or_digits = u'!"#%'()*+,-./:;<=>?@[]^_`{|}~'
translate_table = string.maketrans(not_letters_or_digits,
translate_to
*len(not_letters_or_digits))
translate_table = translate_table.decode("latin-1")
return to_translate.translate(translate_table)
The catch here is that it will implicitly convert all str objects to unicode,
throwing errors if to_translate contains non-ascii characters.
Instead of having to specify all the characters that need to be replaced, you could also view it the other way around and, instead, specify only the valid characters, like so:
import re
def replace_non_alphanumerics(source, replacement_character='_'):
result = re.sub("[^_a-zA-Z0-9]", replacement_character, source)
return result
This works with unicode as well as regular strings, and preserves the type (if both the replacement_character
and the source
are of the same type, obviously).
In this version you can relatively make one’s letters to other
def trans(to_translate):
tabin = u'привет'
tabout = u'тевирп'
tabin = [ord(char) for char in tabin]
translate_table = dict(zip(tabin, tabout))
return to_translate.translate(translate_table)
I found that where in python 2.7, with type str
, you would write
import string
table = string.maketrans("123", "abc")
print "135".translate(table)
whereas with type unicode
you would say
table = {ord(s): unicode(d) for s, d in zip("123", "abc")}
print u"135".translate(table)
In python 3.6 you would write
table = {ord(s): d for s, d in zip("123", "abc")}
print("135".translate(table))
maybe this is helpful.
I had a unique problem compared to the others here. First I knew that my string possibly had unicode chars in it. (Thanks to Email on Mac…) But one of the common chars was the emdash AKA u”u2014″ character which needed to be converted (back) to two dashes AKA “–“. The other chars that might be found are single char translations so they are similar to the other solutions.
First I created a dict for the emdash. For these I use a simple string.replace() to convert them. Other similar chars could be handled here too.
uTranslateDict = {
u"u2014": "--", # Emdash
}
Then I created a tuple for the 1:1 translations. These go through the string.translate() builtin.
uTranslateTuple = [(u"u2010", "-"), # Hyphen
(u"u2013", "-"), # Endash
(u"u2018", "'"), # Left single quote => single quote
(u"u2019", "'"), # Right single quote => single quote
(u"u201a", "'"), # Single Low-9 quote => single quote
(u"u201b", "'"), # Single High-Reversed-9 quote => single quote
(u"u201c", '"'), # Left double quote => double quote
(u"u201d", '"'), # Right double quote => double quote
(u"u201e", '"'), # Double Low-9 quote => double quote
(u"u201f", '"'), # Double High-Reversed-9 quote => double quote
(u"u2022", "*"), # Bullet
]
Then the function.
def uTranslate(uToTranslate):
uTranslateTable = dict((ord(From), unicode(To)) for From, To in uTranslateTuple)
for c in uTranslateDict.keys():
uIntermediateStr = uToTranslate.decode("utf-8").replace(c, uTranslateDict[c])
return uIntermediateStr.translate(uTranslateTable)
Since I know the format of the input string I didn’t have to worry about two types of input strings.
I have the following code:
import string
def translate_non_alphanumerics(to_translate, translate_to='_'):
not_letters_or_digits = u'!"#%'()*+,-./:;<=>?@[]^_`{|}~'
translate_table = string.maketrans(not_letters_or_digits,
translate_to
*len(not_letters_or_digits))
return to_translate.translate(translate_table)
Which works great for non-unicode strings:
>>> translate_non_alphanumerics('<foo>!')
'_foo__'
But fails for unicode strings:
>>> translate_non_alphanumerics(u'<foo>!')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in translate_non_alphanumerics
TypeError: character mapping must return integer, None or unicode
I can’t make any sense of the paragraph on “Unicode objects” in the Python 2.6.2 docs for the str.translate() method.
How do I make this work for Unicode strings?
The Unicode version of translate requires a mapping from Unicode ordinals (which you can retrieve for a single character with ord
) to Unicode ordinals. If you want to delete characters, you map to None
.
I changed your function to build a dict mapping the ordinal of every character to the ordinal of what you want to translate to:
def translate_non_alphanumerics(to_translate, translate_to=u'_'):
not_letters_or_digits = u'!"#%'()*+,-./:;<=>?@[]^_`{|}~'
translate_table = dict((ord(char), translate_to) for char in not_letters_or_digits)
return to_translate.translate(translate_table)
>>> translate_non_alphanumerics(u'<foo>!')
u'_foo__'
edit: It turns out that the translation mapping must map from the Unicode ordinal (via ord
) to either another Unicode ordinal, a Unicode string, or None (to delete). I have thus changed the default value for translate_to
to be a Unicode literal. For example:
>>> translate_non_alphanumerics(u'<foo>!', u'bad')
u'badfoobadbad'
I came up with the following combination of my original function and Mike‘s version that works with Unicode and ASCII strings:
def translate_non_alphanumerics(to_translate, translate_to=u'_'):
not_letters_or_digits = u'!"#%'()*+,-./:;<=>?@[]^_`{|}~'
if isinstance(to_translate, unicode):
translate_table = dict((ord(char), unicode(translate_to))
for char in not_letters_or_digits)
else:
assert isinstance(to_translate, str)
translate_table = string.maketrans(not_letters_or_digits,
translate_to
*len(not_letters_or_digits))
return to_translate.translate(translate_table)
Update: “coerced” translate_to
to unicode for the unicode translate_table
. Thanks Mike.
For a simple hack that will work on both str and unicode objects,
convert the translation table to unicode before running translate():
import string
def translate_non_alphanumerics(to_translate, translate_to='_'):
not_letters_or_digits = u'!"#%'()*+,-./:;<=>?@[]^_`{|}~'
translate_table = string.maketrans(not_letters_or_digits,
translate_to
*len(not_letters_or_digits))
translate_table = translate_table.decode("latin-1")
return to_translate.translate(translate_table)
The catch here is that it will implicitly convert all str objects to unicode,
throwing errors if to_translate contains non-ascii characters.
Instead of having to specify all the characters that need to be replaced, you could also view it the other way around and, instead, specify only the valid characters, like so:
import re
def replace_non_alphanumerics(source, replacement_character='_'):
result = re.sub("[^_a-zA-Z0-9]", replacement_character, source)
return result
This works with unicode as well as regular strings, and preserves the type (if both the replacement_character
and the source
are of the same type, obviously).
In this version you can relatively make one’s letters to other
def trans(to_translate):
tabin = u'привет'
tabout = u'тевирп'
tabin = [ord(char) for char in tabin]
translate_table = dict(zip(tabin, tabout))
return to_translate.translate(translate_table)
I found that where in python 2.7, with type str
, you would write
import string
table = string.maketrans("123", "abc")
print "135".translate(table)
whereas with type unicode
you would say
table = {ord(s): unicode(d) for s, d in zip("123", "abc")}
print u"135".translate(table)
In python 3.6 you would write
table = {ord(s): d for s, d in zip("123", "abc")}
print("135".translate(table))
maybe this is helpful.
I had a unique problem compared to the others here. First I knew that my string possibly had unicode chars in it. (Thanks to Email on Mac…) But one of the common chars was the emdash AKA u”u2014″ character which needed to be converted (back) to two dashes AKA “–“. The other chars that might be found are single char translations so they are similar to the other solutions.
First I created a dict for the emdash. For these I use a simple string.replace() to convert them. Other similar chars could be handled here too.
uTranslateDict = {
u"u2014": "--", # Emdash
}
Then I created a tuple for the 1:1 translations. These go through the string.translate() builtin.
uTranslateTuple = [(u"u2010", "-"), # Hyphen
(u"u2013", "-"), # Endash
(u"u2018", "'"), # Left single quote => single quote
(u"u2019", "'"), # Right single quote => single quote
(u"u201a", "'"), # Single Low-9 quote => single quote
(u"u201b", "'"), # Single High-Reversed-9 quote => single quote
(u"u201c", '"'), # Left double quote => double quote
(u"u201d", '"'), # Right double quote => double quote
(u"u201e", '"'), # Double Low-9 quote => double quote
(u"u201f", '"'), # Double High-Reversed-9 quote => double quote
(u"u2022", "*"), # Bullet
]
Then the function.
def uTranslate(uToTranslate):
uTranslateTable = dict((ord(From), unicode(To)) for From, To in uTranslateTuple)
for c in uTranslateDict.keys():
uIntermediateStr = uToTranslate.decode("utf-8").replace(c, uTranslateDict[c])
return uIntermediateStr.translate(uTranslateTable)
Since I know the format of the input string I didn’t have to worry about two types of input strings.