Complete set of punctuation marks for Python (not just ASCII)

Question:

Is there a listing or library that has all punctuations that we might commonly come across?

Normally I use string.punctuation, but some punctuation characters are not included in it, for example:

>>> "'" in string.punctuation
True
>>> "’" in string.punctuation
False
Asked By: samuelbrody1249

||

Answers:

You might do better with this check:

>>> import unicodedata
>>> unicodedata.category("'").startswith("P")
True
>>> unicodedata.category("’").startswith("P")
True

The Unicode categories P* are specifically for Punctuation:

connector (Pc), dash (Pd), initial quote (Pi), final quote (Pf), open (Ps), close (Pe), other (Po)

To prepare the exhaustive collection, which you can subsequently use for fast membership checks, use a set comprehension:

>>> import sys
>>> from unicodedata import category
>>> codepoints = range(sys.maxunicode + 1)
>>> punctuation = {c for i in codepoints if category(c := chr(i)).startswith("P")}
>>> "'" in punctuation
True
>>> "’" in punctuation
True

Assignment expression here requires Python 3.8+, equivalent for older Python versions:

chrs = (chr(i) for i in range(sys.maxunicode + 1))
punctuation = set(c for c in chrs if category(c).startswith("P"))

Beware that some of the other characters in string.punctuation are actually in Unicode category Symbol. It’s easy to add those in also if you want.

Answered By: wim

The answer posted by wim is correct if you want to check if a character is a punctuation character.

If you really need a list of all punctuation characters as your question title suggests, you can use the following:

import sys
from unicodedata import category
punctuation_chars =  [chr(i) for i in range(sys.maxunicode) 
                             if category(chr(i)).startswith("P")]
Answered By: Selcuk

The answer by wim is great if you can change your code to use a function.

But if you have to use the in operator (for example, you’re calling into library code), you can use duck typing:

import unicodedata
class DuckType:
    def __contains__(self,s):
        return unicodedata.category(s).startswith("P")
punct=DuckType()
#print("'" in punct,'"' in punct,"a" in punct)
Answered By: xkcdjerry

That seems like a pretty job for a regular expression (regexp):

    import re
    text = re.sub(r"[^ws]", "", str(text), flags=re.UNICODE)

Here, the regexp is matching everything except whitespaces or word characters. The flag re.UNICODE is used to match over full set of Unicode characters.

Answered By: Nicolas Martinez

As other answers have pointed out, the way to do this is via Unicode properties/categories. The accepted answer accesses this information via the standard library unicodedata module, but depending on the context where you need this, it might be faster or more convenient to access this same property information using regular expressions.

However, the standard library re module does not provide extended Unicode support. For that, you need the regex module, available on PyPI (pip install regex):

>>> import regex as re
>>> re.match("p{Punctuation}", "'")
<regex.Match object; span=(0, 1), match="'">
>>> re.match("p{Punctuation}", "’")
<regex.Match object; span=(0, 1), match='’'>

A good overview of all the different kinds of Unicode properties you can search for using regular expressions is provided here. Apart from these extra regular expression features, which are documented on its PyPI homepage, regex deliberately provides the same API as re, so you’re expected to use re‘s documentation to figure out how to use either of them.

Answered By: dlukes
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.