Python regex matching Unicode properties
Question:
Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use p{Ll}
to match an arbitrary lower-case letter, or p{Zs}
for any space separator. I don’t see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.
Answers:
You’re right that Unicode property classes are not supported by the Python regex parser.
If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (p{M}
or whatever) and replaces them with the corresponding character sets, so that, for example, p{M}
would become [u0300–u036Fu1DC0–u1DFFu20D0–u20FFuFE20–uFE2F]
, and P{M}
would become [^u0300–u036Fu1DC0–u1DFFu20D0–u20FFuFE20–uFE2F]
.
People would thank you. 🙂
Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say p{Armenian}
to match Armenian characters. p{Ll}
or p{Zs}
work too.
Note that while p{Ll}
has no equivalent in Python regular expressions, p{Zs}
should be covered by '(?u)s'
.
The (?u)
, as the docs say, “Make w, W, b, B, d, D, s and S dependent on the Unicode character properties database.” and s
means any spacing character.
You can painstakingly use unicodedata on each character:
import unicodedata
def strip_accents(x):
return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')
The regex module (an alternative to the standard re
module) supports Unicode codepoint properties with the p{}
syntax.
Speaking of homegrown solutions, some time ago I wrote a small program to do just that – convert a unicode category written as p{...}
into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L
, Zs
), and is restricted to the BMP. I’m posting it here in case someone find it useful (although that Oniguruma really seems a better option).
Example usage:
>>> from unicode_hack import regex
>>> pattern = regex(r'^\p{Lu}(\p{L}|\p{N}|_)*')
>>> print pattern.match(u'疂_1+2').group(0)
疂_1
>>>
Here’s the source. There is also a JavaScript version, using the same data.
Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use p{Ll}
to match an arbitrary lower-case letter, or p{Zs}
for any space separator. I don’t see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.
You’re right that Unicode property classes are not supported by the Python regex parser.
If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (p{M}
or whatever) and replaces them with the corresponding character sets, so that, for example, p{M}
would become [u0300–u036Fu1DC0–u1DFFu20D0–u20FFuFE20–uFE2F]
, and P{M}
would become [^u0300–u036Fu1DC0–u1DFFu20D0–u20FFuFE20–uFE2F]
.
People would thank you. 🙂
Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say p{Armenian}
to match Armenian characters. p{Ll}
or p{Zs}
work too.
Note that while p{Ll}
has no equivalent in Python regular expressions, p{Zs}
should be covered by '(?u)s'
.
The (?u)
, as the docs say, “Make w, W, b, B, d, D, s and S dependent on the Unicode character properties database.” and s
means any spacing character.
You can painstakingly use unicodedata on each character:
import unicodedata
def strip_accents(x):
return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')
The regex module (an alternative to the standard re
module) supports Unicode codepoint properties with the p{}
syntax.
Speaking of homegrown solutions, some time ago I wrote a small program to do just that – convert a unicode category written as p{...}
into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L
, Zs
), and is restricted to the BMP. I’m posting it here in case someone find it useful (although that Oniguruma really seems a better option).
Example usage:
>>> from unicode_hack import regex
>>> pattern = regex(r'^\p{Lu}(\p{L}|\p{N}|_)*')
>>> print pattern.match(u'疂_1+2').group(0)
疂_1
>>>
Here’s the source. There is also a JavaScript version, using the same data.