How do I specify a range of unicode characters

Question:

How do I specify a range of unicode characters from ' ' (space) to u00D7FF?

I have a regular expression like r'[u0020-u00D7FF]' and it won’t compile saying that it’s a bad range. I am new to Unicode regular expressions so I haven’t had this problem before.

Is there a way to make this compile or a regular expression that I’m forgetting or haven’t learned yet?

Asked By: spig

||

Answers:

If you’re using Python 2.x, you should make sure you’re specifying a unicode string (with u”, or the “unicode” built-in):

>>> r = re.compile(u'[u0020-uD7FF]')
>>> r.search(u'foo uD7F0 bar')
<_sre.SRE_Match object at 0xb7084950>
r.search(u' ')
<_sre.SRE_Match object at 0xb7084b48>

Using raw strings (as you are, with r”) gives you the (ascii) string composed by “backstroke” + the letter “u” plus the number 0 plus…

Answered By: rbp

The syntax of your unicode range will not do what you expect.

  1. The raw r'' string prevents u escapes from being parsed, and the regex engine will not do this. The only range in this set is [0-]:

    >>> re.compile(r'[u0020-u00d7ff]', re.DEBUG)
    in
      literal 117
      literal 48
      literal 48
      literal 50
      range (48, 117)
      literal 48
      literal 48
      literal 100
      literal 55
      literal 102
      literal 102
    
  2. Making it a Unicode literal causes u parsing while leaving other backslashes alone (although that’s not a concern here), but the leading zeroes are messing it up. The syntax is uxxxx or Uxxxxxxxx, so it’s parsed as “u00d7, f, f“.

    >>> re.compile(ur'[u0020-u00d7ff]', re.DEBUG)
    in
      range (32, 215)
      literal 102
      literal 102
    
  3. Removing the leading zeroes or switching to U0000d7ff will fix it:

    >>> re.compile(ur'[u0020-ud7ff]', re.DEBUG)
    in
      range (32, 55295)
    
Answered By: Josh Lee
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.