Python – remove punctuation marks at the end and at the beginning of one or more words

Question:

I wanted to know how to remove punctuation marks at the end and at the beginning of one or more words.
If there are punctuation marks between the word, we don’t remove.

for example

input:

word = "!.test-one,-"

output:

word = "test-one"

Asked By: Shrmn

||

Answers:

use strip

>>> import string
>>> word = "!.test-one,-"
>>> word.strip(string.punctuation)
'test-one'
Answered By: Epsi95

Using re.sub

import re
word = "!.test-one,-"
out = re.sub(r"(^[^w]+)|([^w]+$)", "", word)
print(out)

Gives #

test-one
Answered By: Bhargav

Check this example using slice

import string
sentence = "_blogs that are consistently updated by people that know about the trends, and market, and care about giving quality content to their readers."    
if sentence[0] in string.punctuation:
    sentence = sentence[1:]
if sentence[-1] in string.punctuation:
    sentence = sentence[:-1]
print(sentence)

Output:

blogs that are consistently updated by people that know about the trends, and market, and care about giving quality content to their readers
Answered By: Oghli

The best solution is to use Python .strip(chars) method of the built-in class str.

Another approach will be to use a regular expression and the regular expressions module.

In order to understand what strip() and the regular expression does you can take a look at two functions which duplicate the behavior of strip(). The first one using recursion, the second one using while loops:


chars = '''!"#$%&'()*+,-./:;<=>[email protected][]^_`{|}~'''

def cstm_strip_1(word, chars):
    # Approach using recursion: 
    w = word[1 if word[0] in chars else 0: -1 if word[-1] in chars else None]
    if w == word:
        return w
    else: 
        return cstm_strip_1(w, chars)

def cstm_strip_2(word, chars):
    # Approach using a while loop: 
    i , j = 0, -1
    while word[i] in chars:
        i += 1
    while word[j] in chars:
        j -= 1
    return word[i:j+1]

import re, string

chars = string.punctuation
word = "~!.test-one^&test-one--two???"

wsc = word.strip(chars)
assert wsc == cstm_strip_1(word, chars)
assert wsc == cstm_strip_2(word, chars)
assert wsc == re.sub(r"(^[^w]+)|([^w]+$)", "", word)

word = "__~!.test-one^&test-one--two??__"

wsc = word.strip(chars)
assert wsc == cstm_strip_1(word, chars)
assert wsc == cstm_strip_2(word, chars)
# assert wsc == re.sub(r"(^[^w]+)|([^w]+$)", "", word)
assert re.sub(r"(^[^w]+)|([^w]+$)", "", word) == word

print(re.sub(r"(^[^w]+)|([^w]+$)", "", word), '!=', wsc )
print('"',re.sub(r"(^[^w]+)|([^w]+$)", "", "twordt"), '" != "', "twordt".strip(chars), '"', sep='' )

Notice that the result when using the given regular expression pattern can differ from the result when using .strip(string.punctuation) because the set of characters covered by regular expression [^w] pattern differs from the set of characters in string.punctuation.

SUPPLEMENT

What does the regular expression pattern:

(^[^w]+)|([^w]+$)

mean?

Below a detailed explanation:

The '|' character means 'or' providing two alternatives for the 
   sub-string (called match) which is to find in the provided string. 

'(^[^w]+)' is the first of the two alternatives for a match

    '(' ')' enclose what is called a "capturing group" (^[^w]+)

    The first of the two '^' asserts position at start of a line

    'w' : with  escaped 'w' means: "word character" 
        (i.e. letters a-z, A-Z, digits 0-9 and the underscore '_').

    The second of the two '^' means: logical "not" 
      (here not a "word character")
      i.e. all characters except a-zA-z0-9 and '_'
        (for example '~' or 'รถ')
      Notice that the meaning of '^' depends on context: 
        '^' outside of [ ] it means start of line/string
        '^' inside  of [ ] as first char means logical not 
            and not as first means itself 

    '[', ']' enclose specification of a set of characters 
      and mean the occurrence of exactly one of them

    '+' means occurrence between one and unlimited times
        of what was defined in preceding token

    '([^w]+$)' is the second alternative for a match 
        differing from the first by stating that the match
        should be found at the end of the string
        '$' means: "end of the line" (or "end of string")

The regular expression pattern tells the regular expression engine to work as follows:

The engine looks at the start of the string for an occurrence of a non-word
character. If one if found it will be remembered as a match and next
character will be checked and added to the already found ones if it is also
a non-word character. This way the start of the string is checked for
occurrences of non-word characters which then will be removed from the
string if the pattern is used in re.sub(r"(^[^w]+)|([^w]+$)", "", word)
which replaces any found characters with an empty string (in other words
it deletes found character from the string).

After the engine hits first word character in the string the search at
the start of the string will the jump to the end of the string because
of the second alternative given for the pattern to find as the first
alternative is limited to the start of the line.

This way any non-word characters in the intermediate part of the string
will be not searched for.

The engine looks then at the end of a string for a non-word character
and proceeds like at the start but going backwards to assure that the
found non-word characters are at the end of the string.

Answered By: Claudio
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.