How to check if a string in Python is in ASCII?

Question

I want to I check whether a string is in ASCII or not.

I am aware of ord(), however when I try ord('é'), I have TypeError: ord() expected a character, but string of length 2 found. I understood it is caused by the way I built Python (as explained in ord()‘s documentation).

Is there another way to check?

Asked By: Nico

||

Source

Answer 1

You could use the regular expression library which accepts the Posix standard [[:ASCII:]] definition.

Answered By: Steve Moyer

Answer 2

I think you are not asking the right question–

A string in python has no property corresponding to ‘ascii’, utf-8, or any other encoding. The source of your string (whether you read it from a file, input from a keyboard, etc.) may have encoded a unicode string in ascii to produce your string, but that’s where you need to go for an answer.

Perhaps the question you can ask is: “Is this string the result of encoding a unicode string in ascii?” — This you can answer
by trying:

try:
    mystring.decode('ascii')
except UnicodeDecodeError:
    print "it was not a ascii-encoded unicode string"
else:
    print "It may have been an ascii-encoded unicode string"

Answered By: Vincent Marchetti

Answer 3

def is_ascii(s):
    return all(ord(c) < 128 for c in s)

Answered By: Alexander Kojevnikov

Answer 4

How about doing this?

import string

def isAscii(s):
    for c in s:
        if c not in string.ascii_letters:
            return False
    return True

Answered By: miya

Answer 5

Your question is incorrect; the error you see is not a result of how you built python, but of a confusion between byte strings and unicode strings.

Byte strings (e.g. “foo”, or ‘bar’, in python syntax) are sequences of octets; numbers from 0-255. Unicode strings (e.g. u”foo” or u’bar’) are sequences of unicode code points; numbers from 0-1112064. But you appear to be interested in the character é, which (in your terminal) is a multi-byte sequence that represents a single character.

Instead of ord(u'é'), try this:

>>> [ord(x) for x in u'é']

That tells you which sequence of code points “é” represents. It may give you [233], or it may give you [101, 770].

Instead of chr() to reverse this, there is unichr():

>>> unichr(233)
u'xe9'

This character may actually be represented either a single or multiple unicode “code points”, which themselves represent either graphemes or characters. It’s either “e with an acute accent (i.e., code point 233)”, or “e” (code point 101), followed by “an acute accent on the previous character” (code point 770). So this exact same character may be presented as the Python data structure u'eu0301' or u'u00e9'.

Most of the time you shouldn’t have to care about this, but it can become an issue if you are iterating over a unicode string, as iteration works by code point, not by decomposable character. In other words, len(u'eu0301') == 2 and len(u'u00e9') == 1. If this matters to you, you can convert between composed and decomposed forms by using unicodedata.normalize.

The Unicode Glossary can be a helpful guide to understanding some of these issues, by pointing how how each specific term refers to a different part of the representation of text, which is far more complicated than many programmers realize.

Answered By: Glyph

Answer 6

A sting (str-type) in Python is a series of bytes. There is no way of telling just from looking at the string whether this series of bytes represent an ascii string, a string in a 8-bit charset like ISO-8859-1 or a string encoded with UTF-8 or UTF-16 or whatever.

However if you know the encoding used, then you can decode the str into a unicode string and then use a regular expression (or a loop) to check if it contains characters outside of the range you are concerned about.

Answered By: JacquesB

Answer 7

I use the following to determine if the string is ascii or unicode:

>> print 'test string'.__class__.__name__
str
>>> print u'test string'.__class__.__name__
unicode
>>>

Then just use a conditional block to define the function:

def is_ascii(input):
    if input.__class__.__name__ == "str":
        return True
    return False

Answered By: mvknowles

Answer 8

Ran into something like this recently – for future reference

import chardet

encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
    print 'string is in ascii'

which you could use with:

string_ascii = string.decode(encoding['encoding']).encode('ascii')

Answered By: Alvin

Answer 9

I found this question while trying determine how to use/encode/decode a string whose encoding I wasn’t sure of (and how to escape/convert special characters in that string).

My first step should have been to check the type of the string- I didn’t realize there I could get good data about its formatting from type(s). This answer was very helpful and got to the real root of my issues.

If you’re getting a rude and persistent

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 263: ordinal not in range(128)

particularly when you’re ENCODING, make sure you’re not trying to unicode() a string that already IS unicode- for some terrible reason, you get ascii codec errors. (See also the Python Kitchen recipe, and the Python docs tutorials for better understanding of how terrible this can be.)

Eventually I determined that what I wanted to do was this:

escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))

Also helpful in debugging was setting the default coding in my file to utf-8 (put this at the beginning of your python file):

# -*- coding: utf-8 -*-

That allows you to test special characters (‘àéç’) without having to use their unicode escapes (u’xe0xe9xe7′).

>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'&#224;&#233;&#231;'

Answered By: Max P Magee

Answer 10

To prevent your code from crashes, you maybe want to use a try-except to catch TypeErrors

>>> ord("¶")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

For example

def is_ascii(s):
    try:
        return all(ord(c) < 128 for c in s)
    except TypeError:
        return False

Answered By: user2489252

Answer 11

In Python 3, we can encode the string as UTF-8, then check whether the length stays the same. If so, then the original string is ASCII.

def isascii(s):
    """Check if the characters in string s are in ASCII, U+0-U+7F."""
    return len(s) == len(s.encode())

To check, pass the test string:

>>> isascii("♥O◘♦♥O◘♦")
False
>>> isascii("Python")
True

Answered By: far

Answer 12

To improve Alexander’s solution from the Python 2.6 (and in Python 3.x) you can use helper module curses.ascii and use curses.ascii.isascii() function or various other: https://docs.python.org/2.6/library/curses.ascii.html

from curses import ascii

def isascii(s):
    return all(ascii.isascii(c) for c in s)

Answered By: Sergey Nevmerzhitsky

Answer 13

Vincent Marchetti has the right idea, but str.decode has been deprecated in Python 3. In Python 3 you can make the same test with str.encode:

try:
    mystring.encode('ascii')
except UnicodeEncodeError:
    pass  # string is not ascii
else:
    pass  # string is ascii

Note the exception you want to catch has also changed from UnicodeDecodeError to UnicodeEncodeError.

Answered By: drs

Answer 14

import re

def is_ascii(s):
    return bool(re.match(r'[x00-x7F]+$', s))

To include an empty string as ASCII, change the + to *.

Answered By: Roger Dahl

Answer 15

Like @RogerDahl’s answer but it’s more efficient to short-circuit by negating the character class and using search instead of find_all or match.

>>> import re
>>> re.search('[^x00-x7F]', 'Did you catch that x00?') is not None
False
>>> re.search('[^x00-x7F]', 'Did you catch that xFF?') is not None
True

I imagine a regular expression is well-optimized for this.

Answered By: hobs

Answer 16

New in Python 3.7 (bpo32677)

No more tiresome/inefficient ascii checks on strings, new built-in str/bytes/bytearray method – .isascii() will check if the strings is ascii.

print("is this ascii?".isascii())
# True

Answered By: Taku

How to check if a string in Python is in ASCII?

Question:

Answers:

New in Python 3.7 (bpo32677)