Python regex, remove all punctuation except hyphen for unicode string
Question:
I have this code for removing all punctuation from a regex string:
import regex as re
re.sub(ur"p{P}+", "", txt)
How would I change it to allow hyphens? If you could explain how you did it, that would be great. I understand that here, correct me if I’m wrong, P with anything after it is punctuation.
Answers:
You could either specify the punctuation you want to remove manually, as in [._,]
or supply a function instead of the replacement string:
re.sub(r"p{P}", lambda m: "-" if m.group(0) == "-" else "", text)
[^P{P}-]+
P
is the complementary of p
– not punctuation. So this matches anything that is not (not punctuation or a dash) – resulting in all punctuation except dashes.
Example: http://www.rubular.com/r/JsdNM3nFJ3
If you want a non-convoluted way, an alternative is p{P}(?<!-)
: match all punctuation, and then check it wasn’t a dash (using negative lookbehind).
Working example: http://www.rubular.com/r/5G62iSYTdk
Here’s how to do it with the re
module, in case you have to stick with the standard libraries:
# works in python 2 and 3
import re
import string
remove = string.punctuation
remove = remove.replace("-", "") # don't remove hyphens
pattern = r"[{}]".format(remove) # create the pattern
txt = ")*^%{}[]thi's - is - @@#!a !%%!!%- test."
re.sub(pattern, "", txt)
# >>> 'this - is - a - test'
If performance matters, you may want to use str.translate
, since it’s faster than using a regex. In Python 3, the code is txt.translate({ord(char): None for char in remove})
.
I have this code for removing all punctuation from a regex string:
import regex as re
re.sub(ur"p{P}+", "", txt)
How would I change it to allow hyphens? If you could explain how you did it, that would be great. I understand that here, correct me if I’m wrong, P with anything after it is punctuation.
You could either specify the punctuation you want to remove manually, as in [._,]
or supply a function instead of the replacement string:
re.sub(r"p{P}", lambda m: "-" if m.group(0) == "-" else "", text)
[^P{P}-]+
P
is the complementary of p
– not punctuation. So this matches anything that is not (not punctuation or a dash) – resulting in all punctuation except dashes.
Example: http://www.rubular.com/r/JsdNM3nFJ3
If you want a non-convoluted way, an alternative is p{P}(?<!-)
: match all punctuation, and then check it wasn’t a dash (using negative lookbehind).
Working example: http://www.rubular.com/r/5G62iSYTdk
Here’s how to do it with the re
module, in case you have to stick with the standard libraries:
# works in python 2 and 3
import re
import string
remove = string.punctuation
remove = remove.replace("-", "") # don't remove hyphens
pattern = r"[{}]".format(remove) # create the pattern
txt = ")*^%{}[]thi's - is - @@#!a !%%!!%- test."
re.sub(pattern, "", txt)
# >>> 'this - is - a - test'
If performance matters, you may want to use str.translate
, since it’s faster than using a regex. In Python 3, the code is txt.translate({ord(char): None for char in remove})
.