How can I compare a unicode type to a string in python?

Question

I am trying to use a list comprehension that compares string objects, but one of the strings is utf-8, the byproduct of json.loads. Scenario:

us = u'MyString' # is the utf-8 string

Part one of my question, is why does this return False? :

us.encode('utf-8') == "MyString" ## False

Part two – how can I compare within a list comprehension?

myComp = [utfString for utfString in jsonLoadsObj
           if utfString.encode('utf-8') == "MyString"] #wrapped to read on S.O.

EDIT: I’m using Google App Engine, which uses Python 2.7

Here’s a more complete example of the problem:

#json coming from remote server:
#response object looks like:  {"number1":"first", "number2":"second"}

data = json.loads(response)
k = data.keys()

I need something like:
myList = [item for item in k if item=="number1"]  

#### I thought this would work:
myList = [item for item in k if item.encode('utf-8')=="number1"]

Asked By: rGil

||

Source

Answer 1

I’m assuming you’re using Python 3. us.encode('utf-8') == "MyString" returns False because the str.encode() function is returning a bytes object:

In [2]: us.encode('utf-8')
Out[2]: b'MyString'

In Python 3, strings are already Unicode, so the u'MyString' is superfluous.

Answered By: MattDMo

Answer 2

You must be looping over the wrong data set; just loop directly over the JSON-loaded dictionary, there is no need to call .keys() first:

data = json.loads(response)
myList = [item for item in data if item == "number1"]

You may want to use u"number1" to avoid implicit conversions between Unicode and byte strings:

data = json.loads(response)
myList = [item for item in data if item == u"number1"]

Both versions work fine:

>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second"}')
>>> [item for item in data if item == "number1"]
[u'number1']
>>> [item for item in data if item == u"number1"]
[u'number1']

Note that in your first example, us is not a UTF-8 string; it is unicode data, the json library has already decoded it for you. A UTF-8 string on the other hand, is a sequence encoded bytes. You may want to read up on Unicode and Python to understand the difference:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

On Python 2, your expectation that your test returns True would be correct, you are doing something else wrong:

>>> us = u'MyString'
>>> us
u'MyString'
>>> type(us)
<type 'unicode'>
>>> us.encode('utf8') == 'MyString'
True
>>> type(us.encode('utf8'))
<type 'str'>

There is no need to encode the strings to UTF-8 to make comparisons; use unicode literals instead:

myComp = [elem for elem in json_data if elem == u"MyString"]

Answered By: Martijn Pieters

Answer 3

You are trying to compare a string of bytes ('MyString') with a string of Unicode code points (u'MyString'). This is an “apples and oranges” comparison. Unfortunately, Python 2 pretends in some cases that this comparison is valid, instead of always returning False:

>>> u'MyString' == 'MyString'  # in my opinion should be False
True

It’s up to you as the designer/developer to decide what the correct comparison should be. Here is one possible way:

a = u'MyString'
b = 'MyString'
a.encode('UTF-8') == b  # True

I recommend the above instead of a == b.decode('UTF-8') because all u'' style strings can be encoded into bytes with UTF-8, except possibly in some bizarre cases, but not all byte-strings can be decoded to Unicode that way.

But if you choose to do a UTF-8 encode of the Unicode strings before comparing, that will fail for something like this on a Windows system: u'Em dashesu2014are cool'.encode('UTF-8') == 'Em dashesx97are cool'. But if you .encode('Windows-1252') instead it would succeed. That’s why it’s an apples and oranges comparison.

Answered By: wberry

How can I compare a unicode type to a string in python?

Question:

Answers: