Delete digits in Python (Regex)
Question:
I’m trying to delete all digits from a string.
However the next code deletes as well digits contained in any word, and obviously I don’t want that.
I’ve been trying many regular expressions with no success.
Thanks!
s = "This must not b3 delet3d, but the number at the end yes 134411"
s = re.sub("d+", "", s)
print s
Result:
This must not b deletd, but the number at the end yes
Answers:
Add a space before the d+.
>>> s = "This must not b3 delet3d, but the number at the end yes 134411"
>>> s = re.sub(" d+", " ", s)
>>> s
'This must not b3 delet3d, but the number at the end yes '
Edit: After looking at the comments, I decided to form a more complete answer. I think this accounts for all the cases.
s = re.sub("^d+s|sd+s|sd+$", " ", s)
If your number is allways at the end of your strings try :
re.sub("d+$", "", s)
otherwise, you may try
re.sub("(s)d+(s)", "12", s)
You can adjust the back-references to keep only one or two of the spaces (s
match any white separator)
Try this:
"bd+b"
That’ll match only those digits that are not part of another word.
To handle digit strings at the beginning of a line as well:
s = re.sub(r"(^|W)d+", "", s)
Using s
isn’t very good, since it doesn’t handle tabs, et al. A first cut at a better solution is:
re.sub(r"bd+b", "", s)
Note that the pattern is a raw string because b
is normally the backspace escape for strings, and we want the special word boundary regex escape instead. A slightly fancier version is:
re.sub(r"$d+W+|bd+b|W+d+$", "", s)
That tries to remove leading/trailing whitespace when there are digits at the beginning/end of the string. I say “tries” because if there are multiple numbers at the end then you still have some spaces.
Non-regex solution:
>>> s = "This must not b3 delet3d, but the number at the end yes 134411"
>>> " ".join([x for x in s.split(" ") if not x.isdigit()])
'This must not b3 delet3d, but the number at the end yes'
Splits by " "
, and checks if the chunk is a number by doing str().isdigit()
, then joins them back together. More verbosely (not using a list comprehension):
words = s.split(" ")
non_digits = []
for word in words:
if not word.isdigit():
non_digits.append(word)
" ".join(non_digits)
I don’t know what your real situation looks like, but most of the answers look like they won’t handle negative numbers or decimals,
re.sub(r"(b|s+-?|^-?)(d+|d*.d+)b","")
The above should also handle things like,
“This must not b3 delet3d, but the number at the end yes -134.411”
But this is still incomplete – you probably need a more complete definition of what you can expect to find in the files you need to parse.
Edit: it’s also worth noting that ‘b’ changes depending on the locale/character set you are using so you need to be a little careful with that.
>>>s = "This must not b3 delet3d, but the number at the end yes 134411"
>>>s = re.sub(r"d*$", "", s)
>>>s
“This must not b3 delet3d, but the number at the end yes “
This will remove the numericals at the end of the string.
You could try this
s = "This must not b3 delet3d, but the number at the end yes 134411"
re.sub("(sd+)","",s)
result:
'This must not b3 delet3d, but the number at the end yes'
the same rule also applies to
s = "This must not b3 delet3d, 4566 but the number at the end yes 134411"
re.sub("(sd+)","",s)
result:
'This must not b3 delet3d, but the number at the end yes'
To match only pure integers in a string:
b(?<![0-9-])(d+)(?![0-9-])b
It does the right thing with this, matching only everything after million:
max-3 cvd-19 agent-007 8-zoo 2ab c3d ef4 55g h66i jk77
8m9n o0p2 million 0 22 333 4444
All of the other 8 regex answers on this page fail in various ways with that input.
The dash at the end by that first 0-9 … [0-9-] … preserves -007 and the dash in the second set preserves 8-.
Or d in place of 0-9 if you prefer
Can it be simplified?
I had a light-bulb moment, I tried and it works:
sol = re.sub(r'[~^0-9]', '', 'aas30dsa20')
output:
aasdsa
I’m trying to delete all digits from a string.
However the next code deletes as well digits contained in any word, and obviously I don’t want that.
I’ve been trying many regular expressions with no success.
Thanks!
s = "This must not b3 delet3d, but the number at the end yes 134411"
s = re.sub("d+", "", s)
print s
Result:
This must not b deletd, but the number at the end yes
Add a space before the d+.
>>> s = "This must not b3 delet3d, but the number at the end yes 134411"
>>> s = re.sub(" d+", " ", s)
>>> s
'This must not b3 delet3d, but the number at the end yes '
Edit: After looking at the comments, I decided to form a more complete answer. I think this accounts for all the cases.
s = re.sub("^d+s|sd+s|sd+$", " ", s)
If your number is allways at the end of your strings try :
re.sub("d+$", "", s)
otherwise, you may try
re.sub("(s)d+(s)", "12", s)
You can adjust the back-references to keep only one or two of the spaces (s
match any white separator)
Try this:
"bd+b"
That’ll match only those digits that are not part of another word.
To handle digit strings at the beginning of a line as well:
s = re.sub(r"(^|W)d+", "", s)
Using s
isn’t very good, since it doesn’t handle tabs, et al. A first cut at a better solution is:
re.sub(r"bd+b", "", s)
Note that the pattern is a raw string because b
is normally the backspace escape for strings, and we want the special word boundary regex escape instead. A slightly fancier version is:
re.sub(r"$d+W+|bd+b|W+d+$", "", s)
That tries to remove leading/trailing whitespace when there are digits at the beginning/end of the string. I say “tries” because if there are multiple numbers at the end then you still have some spaces.
Non-regex solution:
>>> s = "This must not b3 delet3d, but the number at the end yes 134411"
>>> " ".join([x for x in s.split(" ") if not x.isdigit()])
'This must not b3 delet3d, but the number at the end yes'
Splits by " "
, and checks if the chunk is a number by doing str().isdigit()
, then joins them back together. More verbosely (not using a list comprehension):
words = s.split(" ")
non_digits = []
for word in words:
if not word.isdigit():
non_digits.append(word)
" ".join(non_digits)
I don’t know what your real situation looks like, but most of the answers look like they won’t handle negative numbers or decimals,
re.sub(r"(b|s+-?|^-?)(d+|d*.d+)b","")
The above should also handle things like,
“This must not b3 delet3d, but the number at the end yes -134.411”
But this is still incomplete – you probably need a more complete definition of what you can expect to find in the files you need to parse.
Edit: it’s also worth noting that ‘b’ changes depending on the locale/character set you are using so you need to be a little careful with that.
>>>s = "This must not b3 delet3d, but the number at the end yes 134411"
>>>s = re.sub(r"d*$", "", s)
>>>s
“This must not b3 delet3d, but the number at the end yes “
This will remove the numericals at the end of the string.
You could try this
s = "This must not b3 delet3d, but the number at the end yes 134411"
re.sub("(sd+)","",s)
result:
'This must not b3 delet3d, but the number at the end yes'
the same rule also applies to
s = "This must not b3 delet3d, 4566 but the number at the end yes 134411"
re.sub("(sd+)","",s)
result:
'This must not b3 delet3d, but the number at the end yes'
To match only pure integers in a string:
b(?<![0-9-])(d+)(?![0-9-])b
It does the right thing with this, matching only everything after million:
max-3 cvd-19 agent-007 8-zoo 2ab c3d ef4 55g h66i jk77
8m9n o0p2 million 0 22 333 4444
All of the other 8 regex answers on this page fail in various ways with that input.
The dash at the end by that first 0-9 … [0-9-] … preserves -007 and the dash in the second set preserves 8-.
Or d in place of 0-9 if you prefer
Can it be simplified?
I had a light-bulb moment, I tried and it works:
sol = re.sub(r'[~^0-9]', '', 'aas30dsa20')
output:
aasdsa