How to detect if a String has specific UTF-8 characters in it? (Python)

Question:

I have a list of Strings in python. Now I want to remove all the strings from the list that are special utf-8 characters. I want just the strings which include just the characters from “U+0021” to “U+00FF”. So, do you know a way to detect if a String just contains these special characters?

Thanks 🙂

EDIT: I use Python 3

Asked By: tommitomtom

||

Answers:

You can use regular expression.

import re
mylist = ['str1', 'štr2', 'str3']
regexp = re.compile(r'[^u0021-u00FF]')
good_strs = filter(lambda s: not regexp.search(s), mylist)

[^u0021-u00FF] defines a character set, meaning any one character not in the range from u0021 to u00FF. The letter r before '[u0021-u00FF]' indicates raw string notation, it saves you a lot of escaping works of backslash (”). Without it, every backslash in a regular expression would have to be prefixed with another one to escape it.

regexp.search(r'[u0021-u00FF]',s) will scan through s looking for the first location where the regular expression r'[^u0021-u00FF]' produces a match, and return a corresponding match object. Return None if no match is found.

filter() will filter out the unwanted strings.

This answer is only valid for Python 3

Answered By: ltux

What do you mean exactly by “special utf-8 characters” ?

If you mean every non-ascii character, then you can try:

s.encode('ascii', 'strict')

It will rise an UnicodeDecodeError if the string is not 100% ascii

Answered By: Blablablabli
>>> all_strings = ["okstring", "bađštring", "goodstring"]
>>> acceptible = set(chr(i) for i in range(0x21, 0xFF + 1))
>>> simple_strings = filter(lambda s: set(s).issubset(acceptible), all_strings)
>>> list(simple_strings)
['okstring', 'goodstring']
Answered By: frnhr

The latin1 encoding correspond to the 256 first utf8 characters. Say differently, if c is a unicode character with a code in [0-255], c.encode('latin1') has same value as ord(c).

So to test whether a string has at least one character outside the [0-255] range, just try to encode it as latin1. If it contains none, the encoding will succeed, else you will get a UnicodeEncodeError:

no_special = True
try:
    s.encode('latin1')
except UnicodeEncodeError:
    no_special = False

BTW, as you were told in comment unicode characters outside the [0-255] range are not special, simply they are not in the latin1 range.

Please note that the above also accepts all control characters like t, r or n because they are legal latin1 characters. It may or not be what you want here.

Answered By: Serge Ballesta

The below code snippet worked for me (using Regex in python3):

nonAcceptibleUTF8Chars = list(chr(i) for i in range(161, 255 + 1))
result = re.sub('[' + re.escape(''.join(nonAcceptibleUTF8Chars)) + ']', '', inputString)

inputString = VICTORIAÏ¿½S SECRET

result = VICTORIAS SECRET

Though late to the party, Hope this helps! 🙂

Answered By: VinjaNinja
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.