How can I compare a unicode type to a string in python?
Question:
I am trying to use a list comprehension that compares string objects, but one of the strings is utf-8, the byproduct of json.loads. Scenario:
us = u'MyString' # is the utf-8 string
Part one of my question, is why does this return False? :
us.encode('utf-8') == "MyString" ## False
Part two – how can I compare within a list comprehension?
myComp = [utfString for utfString in jsonLoadsObj
if utfString.encode('utf-8') == "MyString"] #wrapped to read on S.O.
EDIT: I’m using Google App Engine, which uses Python 2.7
Here’s a more complete example of the problem:
#json coming from remote server:
#response object looks like: {"number1":"first", "number2":"second"}
data = json.loads(response)
k = data.keys()
I need something like:
myList = [item for item in k if item=="number1"]
#### I thought this would work:
myList = [item for item in k if item.encode('utf-8')=="number1"]
Answers:
I’m assuming you’re using Python 3. us.encode('utf-8') == "MyString"
returns False
because the str.encode()
function is returning a bytes object:
In [2]: us.encode('utf-8')
Out[2]: b'MyString'
In Python 3, strings are already Unicode, so the u'MyString'
is superfluous.
You must be looping over the wrong data set; just loop directly over the JSON-loaded dictionary, there is no need to call .keys()
first:
data = json.loads(response)
myList = [item for item in data if item == "number1"]
You may want to use u"number1"
to avoid implicit conversions between Unicode and byte strings:
data = json.loads(response)
myList = [item for item in data if item == u"number1"]
Both versions work fine:
>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second"}')
>>> [item for item in data if item == "number1"]
[u'number1']
>>> [item for item in data if item == u"number1"]
[u'number1']
Note that in your first example, us
is not a UTF-8 string; it is unicode data, the json
library has already decoded it for you. A UTF-8 string on the other hand, is a sequence encoded bytes. You may want to read up on Unicode and Python to understand the difference:
-
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
-
-
Pragmatic Unicode by Ned Batchelder
On Python 2, your expectation that your test returns True
would be correct, you are doing something else wrong:
>>> us = u'MyString'
>>> us
u'MyString'
>>> type(us)
<type 'unicode'>
>>> us.encode('utf8') == 'MyString'
True
>>> type(us.encode('utf8'))
<type 'str'>
There is no need to encode the strings to UTF-8 to make comparisons; use unicode literals instead:
myComp = [elem for elem in json_data if elem == u"MyString"]
You are trying to compare a string of bytes ('MyString'
) with a string of Unicode code points (u'MyString'
). This is an “apples and oranges” comparison. Unfortunately, Python 2 pretends in some cases that this comparison is valid, instead of always returning False
:
>>> u'MyString' == 'MyString' # in my opinion should be False
True
It’s up to you as the designer/developer to decide what the correct comparison should be. Here is one possible way:
a = u'MyString'
b = 'MyString'
a.encode('UTF-8') == b # True
I recommend the above instead of a == b.decode('UTF-8')
because all u''
style strings can be encoded into bytes with UTF-8, except possibly in some bizarre cases, but not all byte-strings can be decoded to Unicode that way.
But if you choose to do a UTF-8 encode of the Unicode strings before comparing, that will fail for something like this on a Windows system: u'Em dashesu2014are cool'.encode('UTF-8') == 'Em dashesx97are cool'
. But if you .encode('Windows-1252')
instead it would succeed. That’s why it’s an apples and oranges comparison.
I am trying to use a list comprehension that compares string objects, but one of the strings is utf-8, the byproduct of json.loads. Scenario:
us = u'MyString' # is the utf-8 string
Part one of my question, is why does this return False? :
us.encode('utf-8') == "MyString" ## False
Part two – how can I compare within a list comprehension?
myComp = [utfString for utfString in jsonLoadsObj
if utfString.encode('utf-8') == "MyString"] #wrapped to read on S.O.
EDIT: I’m using Google App Engine, which uses Python 2.7
Here’s a more complete example of the problem:
#json coming from remote server:
#response object looks like: {"number1":"first", "number2":"second"}
data = json.loads(response)
k = data.keys()
I need something like:
myList = [item for item in k if item=="number1"]
#### I thought this would work:
myList = [item for item in k if item.encode('utf-8')=="number1"]
I’m assuming you’re using Python 3. us.encode('utf-8') == "MyString"
returns False
because the str.encode()
function is returning a bytes object:
In [2]: us.encode('utf-8')
Out[2]: b'MyString'
In Python 3, strings are already Unicode, so the u'MyString'
is superfluous.
You must be looping over the wrong data set; just loop directly over the JSON-loaded dictionary, there is no need to call .keys()
first:
data = json.loads(response)
myList = [item for item in data if item == "number1"]
You may want to use u"number1"
to avoid implicit conversions between Unicode and byte strings:
data = json.loads(response)
myList = [item for item in data if item == u"number1"]
Both versions work fine:
>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second"}')
>>> [item for item in data if item == "number1"]
[u'number1']
>>> [item for item in data if item == u"number1"]
[u'number1']
Note that in your first example, us
is not a UTF-8 string; it is unicode data, the json
library has already decoded it for you. A UTF-8 string on the other hand, is a sequence encoded bytes. You may want to read up on Unicode and Python to understand the difference:
-
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
-
Pragmatic Unicode by Ned Batchelder
On Python 2, your expectation that your test returns True
would be correct, you are doing something else wrong:
>>> us = u'MyString'
>>> us
u'MyString'
>>> type(us)
<type 'unicode'>
>>> us.encode('utf8') == 'MyString'
True
>>> type(us.encode('utf8'))
<type 'str'>
There is no need to encode the strings to UTF-8 to make comparisons; use unicode literals instead:
myComp = [elem for elem in json_data if elem == u"MyString"]
You are trying to compare a string of bytes ('MyString'
) with a string of Unicode code points (u'MyString'
). This is an “apples and oranges” comparison. Unfortunately, Python 2 pretends in some cases that this comparison is valid, instead of always returning False
:
>>> u'MyString' == 'MyString' # in my opinion should be False
True
It’s up to you as the designer/developer to decide what the correct comparison should be. Here is one possible way:
a = u'MyString'
b = 'MyString'
a.encode('UTF-8') == b # True
I recommend the above instead of a == b.decode('UTF-8')
because all u''
style strings can be encoded into bytes with UTF-8, except possibly in some bizarre cases, but not all byte-strings can be decoded to Unicode that way.
But if you choose to do a UTF-8 encode of the Unicode strings before comparing, that will fail for something like this on a Windows system: u'Em dashesu2014are cool'.encode('UTF-8') == 'Em dashesx97are cool'
. But if you .encode('Windows-1252')
instead it would succeed. That’s why it’s an apples and oranges comparison.