How do I specify a range of unicode characters
Question:
How do I specify a range of unicode characters from ' '
(space) to u00D7FF
?
I have a regular expression like r'[u0020-u00D7FF]'
and it won’t compile saying that it’s a bad range. I am new to Unicode regular expressions so I haven’t had this problem before.
Is there a way to make this compile or a regular expression that I’m forgetting or haven’t learned yet?
Answers:
If you’re using Python 2.x, you should make sure you’re specifying a unicode string (with u”, or the “unicode” built-in):
>>> r = re.compile(u'[u0020-uD7FF]')
>>> r.search(u'foo uD7F0 bar')
<_sre.SRE_Match object at 0xb7084950>
r.search(u' ')
<_sre.SRE_Match object at 0xb7084b48>
Using raw strings (as you are, with r”) gives you the (ascii) string composed by “backstroke” + the letter “u” plus the number 0 plus…
The syntax of your unicode range will not do what you expect.
-
The raw r''
string prevents u
escapes from being parsed, and the regex engine will not do this. The only range in this set is [0-]
:
>>> re.compile(r'[u0020-u00d7ff]', re.DEBUG)
in
literal 117
literal 48
literal 48
literal 50
range (48, 117)
literal 48
literal 48
literal 100
literal 55
literal 102
literal 102
-
Making it a Unicode literal causes u
parsing while leaving other backslashes alone (although that’s not a concern here), but the leading zeroes are messing it up. The syntax is uxxxx
or Uxxxxxxxx
, so it’s parsed as “u00d7
, f
, f
“.
>>> re.compile(ur'[u0020-u00d7ff]', re.DEBUG)
in
range (32, 215)
literal 102
literal 102
-
Removing the leading zeroes or switching to U0000d7ff
will fix it:
>>> re.compile(ur'[u0020-ud7ff]', re.DEBUG)
in
range (32, 55295)
How do I specify a range of unicode characters from ' '
(space) to u00D7FF
?
I have a regular expression like r'[u0020-u00D7FF]'
and it won’t compile saying that it’s a bad range. I am new to Unicode regular expressions so I haven’t had this problem before.
Is there a way to make this compile or a regular expression that I’m forgetting or haven’t learned yet?
If you’re using Python 2.x, you should make sure you’re specifying a unicode string (with u”, or the “unicode” built-in):
>>> r = re.compile(u'[u0020-uD7FF]')
>>> r.search(u'foo uD7F0 bar')
<_sre.SRE_Match object at 0xb7084950>
r.search(u' ')
<_sre.SRE_Match object at 0xb7084b48>
Using raw strings (as you are, with r”) gives you the (ascii) string composed by “backstroke” + the letter “u” plus the number 0 plus…
The syntax of your unicode range will not do what you expect.
-
The raw
r''
string preventsu
escapes from being parsed, and the regex engine will not do this. The only range in this set is[0-]
:>>> re.compile(r'[u0020-u00d7ff]', re.DEBUG) in literal 117 literal 48 literal 48 literal 50 range (48, 117) literal 48 literal 48 literal 100 literal 55 literal 102 literal 102
-
Making it a Unicode literal causes
u
parsing while leaving other backslashes alone (although that’s not a concern here), but the leading zeroes are messing it up. The syntax isuxxxx
orUxxxxxxxx
, so it’s parsed as “u00d7
,f
,f
“.>>> re.compile(ur'[u0020-u00d7ff]', re.DEBUG) in range (32, 215) literal 102 literal 102
-
Removing the leading zeroes or switching to
U0000d7ff
will fix it:>>> re.compile(ur'[u0020-ud7ff]', re.DEBUG) in range (32, 55295)