How can I compare a unicode type to a string in python?

Question:

I am trying to use a list comprehension that compares string objects, but one of the strings is utf-8, the byproduct of json.loads. Scenario:

us = u'MyString' # is the utf-8 string

Part one of my question, is why does this return False? :

us.encode('utf-8') == "MyString" ## False

Part two – how can I compare within a list comprehension?

myComp = [utfString for utfString in jsonLoadsObj
           if utfString.encode('utf-8') == "MyString"] #wrapped to read on S.O.

EDIT: I’m using Google App Engine, which uses Python 2.7

Here’s a more complete example of the problem:

#json coming from remote server:
#response object looks like:  {"number1":"first", "number2":"second"}

data = json.loads(response)
k = data.keys()

I need something like:
myList = [item for item in k if item=="number1"]  

#### I thought this would work:
myList = [item for item in k if item.encode('utf-8')=="number1"]
Asked By: rGil

||

Answers:

I’m assuming you’re using Python 3. us.encode('utf-8') == "MyString" returns False because the str.encode() function is returning a bytes object:

In [2]: us.encode('utf-8')
Out[2]: b'MyString'

In Python 3, strings are already Unicode, so the u'MyString' is superfluous.

Answered By: MattDMo

You must be looping over the wrong data set; just loop directly over the JSON-loaded dictionary, there is no need to call .keys() first:

data = json.loads(response)
myList = [item for item in data if item == "number1"]  

You may want to use u"number1" to avoid implicit conversions between Unicode and byte strings:

data = json.loads(response)
myList = [item for item in data if item == u"number1"]  

Both versions work fine:

>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second"}')
>>> [item for item in data if item == "number1"]
[u'number1']
>>> [item for item in data if item == u"number1"]
[u'number1']

Note that in your first example, us is not a UTF-8 string; it is unicode data, the json library has already decoded it for you. A UTF-8 string on the other hand, is a sequence encoded bytes. You may want to read up on Unicode and Python to understand the difference:

On Python 2, your expectation that your test returns True would be correct, you are doing something else wrong:

>>> us = u'MyString'
>>> us
u'MyString'
>>> type(us)
<type 'unicode'>
>>> us.encode('utf8') == 'MyString'
True
>>> type(us.encode('utf8'))
<type 'str'>

There is no need to encode the strings to UTF-8 to make comparisons; use unicode literals instead:

myComp = [elem for elem in json_data if elem == u"MyString"]
Answered By: Martijn Pieters

You are trying to compare a string of bytes ('MyString') with a string of Unicode code points (u'MyString'). This is an “apples and oranges” comparison. Unfortunately, Python 2 pretends in some cases that this comparison is valid, instead of always returning False:

>>> u'MyString' == 'MyString'  # in my opinion should be False
True

It’s up to you as the designer/developer to decide what the correct comparison should be. Here is one possible way:

a = u'MyString'
b = 'MyString'
a.encode('UTF-8') == b  # True

I recommend the above instead of a == b.decode('UTF-8') because all u'' style strings can be encoded into bytes with UTF-8, except possibly in some bizarre cases, but not all byte-strings can be decoded to Unicode that way.

But if you choose to do a UTF-8 encode of the Unicode strings before comparing, that will fail for something like this on a Windows system: u'Em dashesu2014are cool'.encode('UTF-8') == 'Em dashesx97are cool'. But if you .encode('Windows-1252') instead it would succeed. That’s why it’s an apples and oranges comparison.

Answered By: wberry