Python regex matching Unicode properties

Question:

Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use p{Ll} to match an arbitrary lower-case letter, or p{Zs} for any space separator. I don’t see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.

Asked By: ThomasH

||

Answers:

You’re right that Unicode property classes are not supported by the Python regex parser.

If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (p{M} or whatever) and replaces them with the corresponding character sets, so that, for example, p{M} would become [u0300–u036Fu1DC0–u1DFFu20D0–u20FFuFE20–uFE2F], and P{M} would become [^u0300–u036Fu1DC0–u1DFFu20D0–u20FFuFE20–uFE2F].

People would thank you. 🙂

Answered By: Jonathan Feinberg

Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say p{Armenian} to match Armenian characters. p{Ll} or p{Zs} work too.

Answered By: joeforker

Note that while p{Ll} has no equivalent in Python regular expressions, p{Zs} should be covered by '(?u)s'.
The (?u), as the docs say, “Make w, W, b, B, d, D, s and S dependent on the Unicode character properties database.” and s means any spacing character.

Answered By: tzot

You can painstakingly use unicodedata on each character:

import unicodedata

def strip_accents(x):
    return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')
Answered By: zellyn

The regex module (an alternative to the standard re module) supports Unicode codepoint properties with the p{} syntax.

Answered By: ronnix

Speaking of homegrown solutions, some time ago I wrote a small program to do just that – convert a unicode category written as p{...} into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L, Zs), and is restricted to the BMP. I’m posting it here in case someone find it useful (although that Oniguruma really seems a better option).

Example usage:

>>> from unicode_hack import regex
>>> pattern = regex(r'^\p{Lu}(\p{L}|\p{N}|_)*')
>>> print pattern.match(u'疂_1+2').group(0)
疂_1
>>>

Here’s the source. There is also a JavaScript version, using the same data.

Answered By: mgibsonbr