Python and regular expression with Unicode
Question:
I need to delete some Unicode symbols from the string ‘بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ’
I know they exist here for sure. I tried:
re.sub('([u064B-u0652u06D4u0670u0674u06D5-u06ED]+)', '', 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ')
but it doesn’t work. String stays the same. What am I doing wrong?
Answers:
Are you using python 2.x or 3.0?
If you’re using 2.x, try making the regex string a unicode-escape string, with ‘u’. Since it’s regex it’s good practice to make your regex string a raw string, with ‘r’. Also, putting your entire pattern in parentheses is superfluous.
re.sub(ur'[u064B-u0652u06D4u0670u0674u06D5-u06ED]+', '', ...)
http://docs.python.org/tutorial/introduction.html#unicode-strings
Edit:
It’s also good practice to use the re.UNICODE/re.U/(?u) flag for unicode regexes, but it only affects character class aliases like w or b, of which this pattern does not use any and so would not be affected by.
Use unicode strings. Use the re.UNICODE flag.
>>> myre = re.compile(ur'[u064B-u0652u06D4u0670u0674u06D5-u06ED]+',
re.UNICODE)
>>> myre
<_sre.SRE_Pattern object at 0xb20b378>
>>> mystr = u'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ'
>>> result = myre.sub('', mystr)
>>> len(mystr), len(result)
(38, 22)
>>> print result
بسم الله الرحمن الرحيم
Read the article by Joel Spolsky called The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
I need to delete some Unicode symbols from the string ‘بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ’
I know they exist here for sure. I tried:
re.sub('([u064B-u0652u06D4u0670u0674u06D5-u06ED]+)', '', 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ')
but it doesn’t work. String stays the same. What am I doing wrong?
Are you using python 2.x or 3.0?
If you’re using 2.x, try making the regex string a unicode-escape string, with ‘u’. Since it’s regex it’s good practice to make your regex string a raw string, with ‘r’. Also, putting your entire pattern in parentheses is superfluous.
re.sub(ur'[u064B-u0652u06D4u0670u0674u06D5-u06ED]+', '', ...)
http://docs.python.org/tutorial/introduction.html#unicode-strings
Edit:
It’s also good practice to use the re.UNICODE/re.U/(?u) flag for unicode regexes, but it only affects character class aliases like w or b, of which this pattern does not use any and so would not be affected by.
Use unicode strings. Use the re.UNICODE flag.
>>> myre = re.compile(ur'[u064B-u0652u06D4u0670u0674u06D5-u06ED]+',
re.UNICODE)
>>> myre
<_sre.SRE_Pattern object at 0xb20b378>
>>> mystr = u'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ'
>>> result = myre.sub('', mystr)
>>> len(mystr), len(result)
(38, 22)
>>> print result
بسم الله الرحمن الرحيم
Read the article by Joel Spolsky called The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)