How to make Python treat literal string as UTF-8 encoded string

Question:

I have some strings in Python loaded from a file. They look like lists, but are actually strings, for example:

example_string = '["hello", "there", "w\u00e5rld"]'

I can easily convert it into an actual list of strings:

def string_to_list(string_list:str) -> List[str]:
    converted = string_list.replace('"', '').replace('[', '').replace(']', '').split(',')
    return [s.strip() for s in converted]
as_list = string_to_list(example_string)
print(as_list) 

Which returns the following list of strings: ["hello", "there", "w\u00e5rld"]
The problem is the encoding of the last element of the string. It looks like this when I run print(as_list), but if I run

for element in as_list:
    print(element)

it returns

hello
there
wu00e5rld

I dont know what happens to the first backslash, it seems to me like it is there to escape the second one in the encoding. How do I make Python just resolve the UTF-8 character and print "wørld"? The problem is that it is a string, not an encoding, so as_list[2].decode("UTF-8") does not work.

I tried using string.decode(), and I tried plain printing

Asked By: Stine Nyhus

||

Answers:

The correct way to decode that to a list of strings is not the insane set of string operations you’re performing. It’s just ast.literal_eval(example_string), which will handle Unicode escapes just fine:

    import ast
    
    example_string = '["hello", "there", "w\u00e5rld"]'
    example_list = ast.literal_eval(example_string)
    for word in example_list:
        print(word)

which, assuming you have appropriate font support for the character, outputs:

hello
there
wårld

If you absolutely needed to just fix Unicode escapes, the codecs module can be used for unicode_escape decoding, but in this case, you have a legal Python literal in a string, and ast.literal_eval can do all the work.

Answered By: ShadowRanger
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.