Python – Issues with Unicode String from API Call

Question:

I’m using Python to call an API that returns the last name of some soccer players. One of the players has a "ć" in his name.

When I call the endpoint, the name prints out with the unicode attached to it:

>>> last_name = (json.dumps(response["response"][2]["player"]["lastname"]))

>>> print(last_name)

"Mitroviu0107"

>>> print(type(last_name))

<class 'str'>

If I were to take copy and paste that output and put it in a variable on its own like so:

>>> print("Mitroviu0107")

Mitrović

>>> print(type("Mitroviu0107"))

<class 'str'>

Then it prints just fine?

What is wrong with the API endpoint call and the string that comes from it?

Asked By: SquareHammer89

||

Answers:

Count the number of characters in your string & I’ll bet you’ll notice that the result of json is 13 characters:

"M-i-t-r-o-v-i–u-0-1-0-7", or "Mitrovi\u0107"

When you copy "Mitroviu0107" you’re coping 8 characters and the ‘u0107’ is a single unicode character.

That would suggest the endpoint is not sending properly json-escaped unicode, or somewhere in your doc you’re reading it as ascii first. Carefully look at exactly what you’re receiving.

Answered By: pbuck

Well, you serialise the string with json.dumps() before printing it, that’s why you get a different output.
Compare the following:

>>> print("Mitrović")
Mitrović

and

>>> print(json.dumps("Mitrović"))
"Mitroviu0107"

The second command adds double quotes to the output and escapes non-ASCII chars, because that’s how strings are encoded in JSON. So it’s possible that response["response"][2]["player"]["lastname"] contains exactly what you want, but maybe you fooled yourself by wrapping it in json.dumps() before printing.

Note: don’t confuse Python string literals and JSON serialisation of strings. They share some common features, but they aren’t the same (eg. JSON strings can’t be single-quoted), and they serve a different purpose (the first are for writing strings in source code, the second are for encoding data for sending it accross the network).

Another note: You can avoid most of the escaping with ensure_ascii=False in the json.dumps() call:

>>> print(json.dumps("Mitrović", ensure_ascii=False))
"Mitrović"
Answered By: lenz
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.