Removing all non-numeric characters from string in Python

Question:

How do we remove all non-numeric characters from a string in Python?

Asked By: grizzley

||

Answers:

Not sure if this is the most efficient way, but:

>>> ''.join(c for c in "abc123def456" if c.isdigit())
'123456'

The ''.join part means to combine all the resulting characters together without any characters in between. Then the rest of it is a generator expression, where (as you can probably guess) we only take the parts of the string that match the condition isdigit.

Answered By: Mark Rushakoff
>>> import re
>>> re.sub("[^0-9]", "", "sdkjh987978asd098as0980a98sd")
'987978098098098'
Answered By: Ned Batchelder

Fastest approach, if you need to perform more than just one or two such removal operations (or even just one, but on a very long string!-), is to rely on the translate method of strings, even though it does need some prep:

>>> import string
>>> allchars = ''.join(chr(i) for i in xrange(256))
>>> identity = string.maketrans('', '')
>>> nondigits = allchars.translate(identity, string.digits)
>>> s = 'abc123def456'
>>> s.translate(identity, nondigits)
'123456'

The translate method is different, and maybe a tad simpler simpler to use, on Unicode strings than it is on byte strings, btw:

>>> unondig = dict.fromkeys(xrange(65536))
>>> for x in string.digits: del unondig[ord(x)]
... 
>>> s = u'abc123def456'
>>> s.translate(unondig)
u'123456'

You might want to use a mapping class rather than an actual dict, especially if your Unicode string may potentially contain characters with very high ord values (that would make the dict excessively large;-). For example:

>>> class keeponly(object):
...   def __init__(self, keep): 
...     self.keep = set(ord(c) for c in keep)
...   def __getitem__(self, key):
...     if key in self.keep:
...       return key
...     return None
... 
>>> s.translate(keeponly(string.digits))
u'123456'
>>> 
Answered By: Alex Martelli

This should work for both strings and unicode objects in Python2, and both strings and bytes in Python3:

# python <3.0
def only_numerics(seq):
    return filter(type(seq).isdigit, seq)

# python ≥3.0
def only_numerics(seq):
    seq_type= type(seq)
    return seq_type().join(filter(seq_type.isdigit, seq))
Answered By: tzot

Just to add another option to the mix, there are several useful constants within the string module. While more useful in other cases, they can be used here.

>>> from string import digits
>>> ''.join(c for c in "abc123def456" if c in digits)
'123456'

There are several constants in the module, including:

  • ascii_letters (abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ)
  • hexdigits (0123456789abcdefABCDEF)

If you are using these constants heavily, it can be worthwhile to covert them to a frozenset. That enables O(1) lookups, rather than O(n), where n is the length of the constant for the original strings.

>>> digits = frozenset(digits)
>>> ''.join(c for c in "abc123def456" if c in digits)
'123456'
Answered By: Tim McNamara

@Ned Batchelder and @newacct provided the right answer, but …

Just in case if you have comma(,) decimal(.) in your string:

import re
re.sub("[^d.]", "", "$1,999,888.77")
'1999888.77'
Answered By: kennyut

Many right answers but in case you want it in a float, directly, without using regex:

x= '$123.45M'

float(''.join(c for c in x if (c.isdigit() or c =='.'))

123.45

You can change the point for a comma depending on your needs.

change for this if you know your number is an integer

x='$1123'    
int(''.join(c for c in x if c.isdigit())

1123

Answered By: Alberto Ibarra

An easy way:

str.isdigit() returns True if str contains only numeric characters. Call filter(predicate, iterable) with str.isdigit as predicate and the string as iterable to return an iterable containing only the string’s numeric characters. Call str.join(iterable) with the empty string as str and the result of filter() as iterable to join each numeric character together into one string.

For example:

a_string = "!1a2;b3c?"
numeric_filter = filter(str.isdigit, a_string)
numeric_string = "".join(numeric_filter)
print(numeric_string)

And the output is:

123
Answered By: Ehsan Akbaritabar

There are a lot of correct answers here. Some are faster or slower than others. The approach used in Ehsan Akbaritabar’s and tzot’s answers, filter with str.isdigit, is really fast; as is translate, from Alex Martelli’s answer, once the setup is done. These are the two fastest methods. However, if you are only doing the substitution once, the setup penalty for translate is significant.

Which way is the best may depend on your use case. One replacement in a unit test? I’d go for filter using isdigit. It requires no imports, uses builtins only, and is quick and easy:

''.join(filter(str.isdigit, string_to_filter))

In a pandas or pyspark DataFrame, with millions of rows, the efficiency of translate is probably worth it, if you don’t use the methods the DataFrame provides (which tend to rely on regex).

If you want to take the use translate approach, I’d recommend some changes for Python 3:

import string

unicode_non_digits = dict.fromkeys(
    [x for x in range(65536) if chr(x) not in string.digits]
)
string_to_filter.translate(unicode_non_digits)
Method Loops Repeats Best of result per loop
filter using isdigit 1000 15 0.83 usec
generator using isdigit 1000 15 1.6 usec
using re.sub 1000 15 1.94 usec
generator testing membership in digits 1000 15 1.23 usec
generator testing membership in digits set 1000 15 1.19 usec
use translate 1000 15 0.797 usec
use re.compile 1000 15 1.52 usec
use translate but make translation table every time 20 5 1.21e+04 usec

That last row in the table is to show the setup penalty for translate. I used the default number and repeat options when creating the translation table every time, otherwise it takes too long.

The raw output from my timing script:

/bin/zsh /Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:6> which python
/Users/henry.longmore/.pyenv/shims/python
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:7> python --version
Python 3.10.6
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:8> set +x
-----filter using isdigit
1000 loops, best of 15: 0.83 usec per loop
-----generator using isdigit
1000 loops, best of 15: 1.6 usec per loop
-----using re.sub
1000 loops, best of 15: 1.94 usec per loop
-----generator testing membership in digits
1000 loops, best of 15: 1.23 usec per loop
-----generator testing membership in digits set
1000 loops, best of 15: 1.19 usec per loop
-----use translate
1000 loops, best of 15: 0.797 usec per loop
-----use re.compile
1000 loops, best of 15: 1.52 usec per loop
-----use translate but make translation table every time
     using default number and repeat, otherwise this takes too long
20 loops, best of 5: 1.21e+04 usec per loop

The script I used for the timings:

NUMBER=1000
REPEAT=15
UNIT="usec"
TEST_STRING="abc123def45ghi6789"
set -x
which python
python --version
set +x
echo "-----filter using isdigit"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT "''.join(filter(str.isdigit, '${TEST_STRING}'))"
echo "-----generator using isdigit"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT "''.join(c for c in '${TEST_STRING}' if c.isdigit())"
echo "-----using re.sub"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import re" "re.sub('[^0-9]', '', '${TEST_STRING}')"
echo "-----generator testing membership in digits"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="from string import digits" "''.join(c for c in '${TEST_STRING}' if c in digits)"
echo "-----generator testing membership in digits set"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="from string import digits; digits = {*digits}" "''.join(c for c in '${TEST_STRING}' if c in digits)"
echo "-----use translate"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import string; unicode_non_digits = dict.fromkeys([x for x in range(65536) if chr(x) not in string.digits])" "'${TEST_STRING}'.translate(unicode_non_digits)"
echo "-----use re.compile"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import re; digit_filter = re.compile('[^0-9]')" "digit_filter.sub('', '${TEST_STRING}')"
echo "-----use translate but make translation table every time"
echo "     using default number and repeat, otherwise this takes too long"
python -m timeit --unit=$UNIT --setup="import string" "unicode_non_digits = dict.fromkeys([x for x in range(65536) if chr(x) not in string.digits]); '${TEST_STRING}'.translate(unicode_non_digits)"


Answered By: hlongmore
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.