How to detect non-ASCII character in Python?
Question:
I’m parsing multiple XML files with Python 2.7, there are some strings like: string ="[2,3,13,37–41,43,44,46]"
. I split them to get a list of all elements, and then I have to detect elements with “–” like “37–41”, but it turns out this is not a regular dash, it’s a non-ASCII character:
elements = [u'2', u'3', u'13', u'37u201341', u'43', u'44', u'46']
So I need something like
for e in elements:
if "–" in e:
# do something about it
If use that non-ASCII char in this if expression, then I get an error: "SyntaxError: Non-ASCII character 'xe2' in file..."
.
I tried to replace the if
expression with this re method:
re.search('xe2', e)
but it’s not the case again. So I’m looking for a way to either convert that non-ASCII char to a regular ASCII “-” or use the ASCII number directly in the search expression.
Answers:
Give this a try:
>>> import re
>>> non_decimal = re.compile(r'[^d.]+')
>>>
>>> string ="[2,3,13,37–41,43,44,46]"
>>> new_str = string.replace("[","")
>>> new_str = new_str.replace("]","")
>>> lst = new_str.split(",")
>>> for element in lst:
if element.isdigit():
print element
else:
toexpand = non_decimal.sub('f', str(element))
toexpand = toexpand.split("f")
for i in range(int(toexpand[0]),int(toexpand[1])+1,1):
print i
2
3
13
37
38
39
40
41
43
44
46
>>>
You have to declare your encoding in your Python program, for example:
# -*- coding: utf-8 -*-
Usually Python tells you about this issue:
SyntaxError: Non-ASCII character ‘xe2’ in file ./fail.py on line 3, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
After adding the encoying, your code should work just fine.
# -*- coding: utf-8 -*-
import re
elements = [u'2', u'3', u'13', u'37u201341', u'43', u'44', u'46']
for e in elements:
if (re.sub('[ -~]', '', e)) != "":
#do something here
print "-"
re.sub('[ -~]', '', e)
will strip out any valid ASCII characters in e
(Specifically, replace any valid ASCII characters with “”), only non-ASCII characters of e are remained.
Hope this help
You can check the if the character value is between 0 – 127.
for c in someString:
if 0 <= ord(c) <= 127:
# this is a ascii character.
else:
# this is a non-ascii character. Do something.
This may not answer your whole question. Way too simple and not flexible. I do this whenever I have this error.
I usually open up an interactive python shell and I type in:
print [ln for ln in open("filename.py", "rb").readlines() if "xe2" in ln]
That gives you lines with ex2. Then try finding it in your editor.and try removing the character.
For a more generalized approach you can use libraries such as chardet
(pure python) or cchardet
(a faster C alternative). Install the module with e.g. with pip3 install chardet
or pip3 install cchardet
.
They both have the same API:
import chardet
with open("/etc/hosts", "rb") as fi:
print(chardet.detect(fi.read()))
# OUTPUT:
# {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
If the encoding
key in the dict is not ascii
then you have non-ascii characters in the file.
Both modules expose command line tools that you can use to detect which of your XML files are non-ASCII:
find . -iname "*.xml" -exec cchardetect {} +
Normally, you would only need to do the detection when working with mysterious legacy data of unknown origin that is not utf-8/unicode.
If you have a hard requirement to convert everything to ASCII then you can do something like:
import unicodedata
unicodedata.normalize('NFKD', 'Verhältnismäßigkeit — 1').encode('ascii', 'ignore')
# OUTPUT
# b'Verhaltnismaigkeit 1'
I’m parsing multiple XML files with Python 2.7, there are some strings like: string ="[2,3,13,37–41,43,44,46]"
. I split them to get a list of all elements, and then I have to detect elements with “–” like “37–41”, but it turns out this is not a regular dash, it’s a non-ASCII character:
elements = [u'2', u'3', u'13', u'37u201341', u'43', u'44', u'46']
So I need something like
for e in elements:
if "–" in e:
# do something about it
If use that non-ASCII char in this if expression, then I get an error: "SyntaxError: Non-ASCII character 'xe2' in file..."
.
I tried to replace the if
expression with this re method:
re.search('xe2', e)
but it’s not the case again. So I’m looking for a way to either convert that non-ASCII char to a regular ASCII “-” or use the ASCII number directly in the search expression.
Give this a try:
>>> import re
>>> non_decimal = re.compile(r'[^d.]+')
>>>
>>> string ="[2,3,13,37–41,43,44,46]"
>>> new_str = string.replace("[","")
>>> new_str = new_str.replace("]","")
>>> lst = new_str.split(",")
>>> for element in lst:
if element.isdigit():
print element
else:
toexpand = non_decimal.sub('f', str(element))
toexpand = toexpand.split("f")
for i in range(int(toexpand[0]),int(toexpand[1])+1,1):
print i
2
3
13
37
38
39
40
41
43
44
46
>>>
You have to declare your encoding in your Python program, for example:
# -*- coding: utf-8 -*-
Usually Python tells you about this issue:
SyntaxError: Non-ASCII character ‘xe2’ in file ./fail.py on line 3, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
After adding the encoying, your code should work just fine.
# -*- coding: utf-8 -*-
import re
elements = [u'2', u'3', u'13', u'37u201341', u'43', u'44', u'46']
for e in elements:
if (re.sub('[ -~]', '', e)) != "":
#do something here
print "-"
re.sub('[ -~]', '', e)
will strip out any valid ASCII characters in e
(Specifically, replace any valid ASCII characters with “”), only non-ASCII characters of e are remained.
Hope this help
You can check the if the character value is between 0 – 127.
for c in someString:
if 0 <= ord(c) <= 127:
# this is a ascii character.
else:
# this is a non-ascii character. Do something.
This may not answer your whole question. Way too simple and not flexible. I do this whenever I have this error.
I usually open up an interactive python shell and I type in:
print [ln for ln in open("filename.py", "rb").readlines() if "xe2" in ln]
That gives you lines with ex2. Then try finding it in your editor.and try removing the character.
For a more generalized approach you can use libraries such as chardet
(pure python) or cchardet
(a faster C alternative). Install the module with e.g. with pip3 install chardet
or pip3 install cchardet
.
They both have the same API:
import chardet
with open("/etc/hosts", "rb") as fi:
print(chardet.detect(fi.read()))
# OUTPUT:
# {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
If the encoding
key in the dict is not ascii
then you have non-ascii characters in the file.
Both modules expose command line tools that you can use to detect which of your XML files are non-ASCII:
find . -iname "*.xml" -exec cchardetect {} +
Normally, you would only need to do the detection when working with mysterious legacy data of unknown origin that is not utf-8/unicode.
If you have a hard requirement to convert everything to ASCII then you can do something like:
import unicodedata
unicodedata.normalize('NFKD', 'Verhältnismäßigkeit — 1').encode('ascii', 'ignore')
# OUTPUT
# b'Verhaltnismaigkeit 1'