How do I specify a range of unicode characters

Question

How do I specify a range of unicode characters from ' ' (space) to u00D7FF?

I have a regular expression like r'[u0020-u00D7FF]' and it won’t compile saying that it’s a bad range. I am new to Unicode regular expressions so I haven’t had this problem before.

Is there a way to make this compile or a regular expression that I’m forgetting or haven’t learned yet?

Asked By: spig

||

Source

Answer 1

If you’re using Python 2.x, you should make sure you’re specifying a unicode string (with u”, or the “unicode” built-in):

>>> r = re.compile(u'[u0020-uD7FF]')
>>> r.search(u'foo uD7F0 bar')
<_sre.SRE_Match object at 0xb7084950>
r.search(u' ')
<_sre.SRE_Match object at 0xb7084b48>

Using raw strings (as you are, with r”) gives you the (ascii) string composed by “backstroke” + the letter “u” plus the number 0 plus…

Answered By: rbp

Answer 2

The syntax of your unicode range will not do what you expect.

The raw r'' string prevents u escapes from being parsed, and the regex engine will not do this. The only range in this set is [0-]:

>>> re.compile(r'[u0020-u00d7ff]', re.DEBUG)
in
  literal 117
  literal 48
  literal 48
  literal 50
  range (48, 117)
  literal 48
  literal 48
  literal 100
  literal 55
  literal 102
  literal 102

Making it a Unicode literal causes u parsing while leaving other backslashes alone (although that’s not a concern here), but the leading zeroes are messing it up. The syntax is uxxxx or Uxxxxxxxx, so it’s parsed as “u00d7, f, f“.
```
>>> re.compile(ur'[u0020-u00d7ff]', re.DEBUG)
in
  range (32, 215)
  literal 102
  literal 102
```

Removing the leading zeroes or switching to U0000d7ff will fix it:

>>> re.compile(ur'[u0020-ud7ff]', re.DEBUG)
in
  range (32, 55295)

Answered By: Josh Lee

How do I specify a range of unicode characters

Question:

Answers: