How do I treat an ASCII string as unicode and unescape the escaped characters in it in python?

Question:

For example, if I have a unicode string, I can encode it as an ASCII string like so:

>>> u'u003cfoo/u003e'.encode('ascii')
'<foo/>'

However, I have e.g. this ASCII string:

'u003foou003e'

… that I want to turn into the same ASCII string as in my first example above:

'<foo/>'
Asked By: John

||

Answers:

It’s a little dangerous depending on where the string is coming from, but how about:

>>> s = 'u003cfoou003e'
>>> eval('u"'+s.replace('"', r'"')+'"').encode('ascii')
'<foo>'
Answered By: Ned Batchelder

It took me a while to figure this one out, but this page had the best answer:

>>> s = 'u003cfoo/u003e'
>>> s.decode( 'unicode-escape' )
u'<foo/>'
>>> s.decode( 'unicode-escape' ).encode( 'ascii' )
'<foo/>'

There’s also a ‘raw-unicode-escape’ codec to handle the other way to specify Unicode strings — check the “Unicode Constructors” section of the linked page for more details (since I’m not that Unicode-saavy).

EDIT: See also Python Standard Encodings.

Answered By: hark

On Python 2.5 the correct encoding is “unicode_escape”, not “unicode-escape” (note the underscore).

I’m not sure if the newer version of Python changed the unicode name, but here only worked with the underscore.

Anyway, this is it.

Answered By: Kaniabi

Ned Batchelder said:

It’s a little dangerous depending on where the string is coming from,
but how about:

>>> s = 'u003cfoou003e'
>>> eval('u"'+s.replace('"', r'"')+'"').encode('ascii')
'<foo>'

Actually this method can be made safe like so:

>>> s = 'u003cfoou003e'
>>> s_unescaped = eval('u"""'+s.replace('"', r'"')+'-"""')[:-1]

Mind the triple-quote string and the dash right before the closing 3-quotes.

  1. Using a 3-quoted string will ensure that if the user enters ‘ \” ‘ (spaces added for visual clarity) in the string it would not disrupt the evaluator;
  2. The dash at the end is a failsafe in case the user’s string ends with a ‘ ” ‘ . Before we assign the result we slice the inserted dash with [:-1]

So there would be no need to worry about what the users enter, as long as it is captured in raw format.

Answered By: MakerDrone

At some point you will run into issues when you encounter special characters like Chinese characters or emoticons in a string you want to decode i.e. errors that look like this:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 109-123: ordinal not in range(128)

For my case (twitter data processing), I decoded as follows to allow me to see all characters with no errors

>>> s = 'u003cfoou003e'
>>> s.decode( 'unicode-escape' ).encode( 'utf-8' )
>>> <foo>
Answered By: Okezie
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.