Selective replacement of unicode characters in Python using regex

Question:

There are many answers as to how one can use regex to remove unicode characters in Python.

See Remove Unicode code (uxxx) in string Python and Python regex module "re" match unicode characters with u

However, in my case, I don’t want to replace every unicode character but only the ones that are displayed with their u code, not the ones that are properly shown as characters. I have tried both solutions and they remove both types of unicode characters.

u2002pandemic becomes pandemic
and master’s becomes masters

Is there a general solutions to removing the first type of unicode characters but keeping the second kind?

Asked By: user1627466

||

Answers:

This uses the idea that the debug representation (repr()) of a text will show escape codes for non-printable characters, so it removes those escape codes (three types: xnn, unnnn, Unnnnnnnn) and evaluates the result:

import re
import ast

text = 'x19x40u2002u2019U0001e526U0001f235\u1234\U00012345\xff\u2002'
#       ^^^^    ^^^^^^      ^^^^^^^^^^                                   ^^^^^^
# To remove above, others are printable escape codes or literal backslashes.
# If preceded by an odd number of backslashes, it's an escape code.
print('printed text:   ', text)
print('repr() text:    ', repr(text))
clean_text = ast.literal_eval(re.sub(r'''(?x)                # verbose mode
                                         (?<!\)             # not preceded by literal backslash
                                         ((?:\\)*)         # zero or more pairs literal backslashes (group 1)
                                         \                  # match a literal backslash
                                         (?:                 # non-capturing group
                                         (?:x[0-9a-f]{2}) |  # match an x and 2 hexadecimal digits OR
                                         (?:u[0-9a-f]{4}) |  # match a u and 4 hex digits OR
                                         (?:U[0-9a-f]{8})    # match a U and 8 hex digits
                                         )                   # end non-capturing group
                                         ''',
                                         r'1'               # replace with group 1 (pairs of backslashes, if any)
                                         , repr(text)))      # string to operate on
print('cleaned text:   ', clean_text)
print('cleaned repr(): ', repr(clean_text))

Output:

printed text:    @ ’ u1234U00012345xff 
repr() text:     'x19@u2002’U0001e526 \u1234\U00012345\xff\u2002'
cleaned text:    @’ u1234U00012345xff
cleaned repr():  '@’ \u1234\U00012345\xff\'

Note you may not want to remove all characters that display as escape codes. Their str() (print display) vs. repr() (debug display) may be something desirable. For example, u2002 is an EN SPACE (another type of SPACE character) and prints as a space. The debug representation only shows it as an escape code so you can tell the difference between an ASCII SPACE and an EN SPACE.

Answered By: Mark Tolonen

There’s isprintable exactly for this type of thing:

src = 'a u200d x1f Ü ß ы'

cleaned = ''.join(c for c in src if c.isprintable())

print(repr(src))
print(repr(cleaned))

# 'a u200d x1f Ü ß ы'
# 'a   Ü ß ы'
Answered By: gog
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.