How to make Python treat literal string as UTF-8 encoded string
Question:
I have some strings in Python loaded from a file. They look like lists, but are actually strings, for example:
example_string = '["hello", "there", "w\u00e5rld"]'
I can easily convert it into an actual list of strings:
def string_to_list(string_list:str) -> List[str]:
converted = string_list.replace('"', '').replace('[', '').replace(']', '').split(',')
return [s.strip() for s in converted]
as_list = string_to_list(example_string)
print(as_list)
Which returns the following list of strings: ["hello", "there", "w\u00e5rld"]
The problem is the encoding of the last element of the string. It looks like this when I run print(as_list)
, but if I run
for element in as_list:
print(element)
it returns
hello
there
wu00e5rld
I dont know what happens to the first backslash, it seems to me like it is there to escape the second one in the encoding. How do I make Python just resolve the UTF-8 character and print "wørld"? The problem is that it is a string, not an encoding, so as_list[2].decode("UTF-8")
does not work.
I tried using string.decode(), and I tried plain printing
Answers:
The correct way to decode that to a list
of strings is not the insane set of string operations you’re performing. It’s just ast.literal_eval(example_string)
, which will handle Unicode escapes just fine:
import ast
example_string = '["hello", "there", "w\u00e5rld"]'
example_list = ast.literal_eval(example_string)
for word in example_list:
print(word)
which, assuming you have appropriate font support for the character, outputs:
hello
there
wårld
If you absolutely needed to just fix Unicode escapes, the codecs
module can be used for unicode_escape
decoding, but in this case, you have a legal Python literal in a string, and ast.literal_eval
can do all the work.
I have some strings in Python loaded from a file. They look like lists, but are actually strings, for example:
example_string = '["hello", "there", "w\u00e5rld"]'
I can easily convert it into an actual list of strings:
def string_to_list(string_list:str) -> List[str]:
converted = string_list.replace('"', '').replace('[', '').replace(']', '').split(',')
return [s.strip() for s in converted]
as_list = string_to_list(example_string)
print(as_list)
Which returns the following list of strings: ["hello", "there", "w\u00e5rld"]
The problem is the encoding of the last element of the string. It looks like this when I run print(as_list)
, but if I run
for element in as_list:
print(element)
it returns
hello
there
wu00e5rld
I dont know what happens to the first backslash, it seems to me like it is there to escape the second one in the encoding. How do I make Python just resolve the UTF-8 character and print "wørld"? The problem is that it is a string, not an encoding, so as_list[2].decode("UTF-8")
does not work.
I tried using string.decode(), and I tried plain printing
The correct way to decode that to a list
of strings is not the insane set of string operations you’re performing. It’s just ast.literal_eval(example_string)
, which will handle Unicode escapes just fine:
import ast
example_string = '["hello", "there", "w\u00e5rld"]'
example_list = ast.literal_eval(example_string)
for word in example_list:
print(word)
which, assuming you have appropriate font support for the character, outputs:
hello
there
wårld
If you absolutely needed to just fix Unicode escapes, the codecs
module can be used for unicode_escape
decoding, but in this case, you have a legal Python literal in a string, and ast.literal_eval
can do all the work.