Strip all non-numeric characters (except for ".") from a string in Python
Question:
I’ve got a pretty good working snippit of code, but I was wondering if anyone has any better suggestions on how to do this:
val = ''.join([c for c in val if c in '1234567890.'])
What would you do?
Answers:
You can use a regular expression (using the re
module) to accomplish the same thing. The example below matches runs of [^d.]
(any character that’s not a decimal digit or a period) and replaces them with the empty string. Note that if the pattern is compiled with the UNICODE
flag the resulting string could still include non-ASCII numbers. Also, the result after removing “non-numeric” characters is not necessarily a valid number.
>>> import re
>>> non_decimal = re.compile(r'[^d.]+')
>>> non_decimal.sub('', '12.34fe4e')
'12.344'
Here’s some sample code:
$ cat a.py
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
''.join([c for c in a if c in '1234567890.'])
$ cat b.py
import re
non_decimal = re.compile(r'[^d.]+')
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
non_decimal.sub('', a)
$ cat c.py
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
''.join([c for c in a if c.isdigit() or c == '.'])
$ cat d.py
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
b = []
for c in a:
if c.isdigit() or c == '.': continue
b.append(c)
''.join(b)
And the timing results:
$ time python a.py
real 0m24.735s
user 0m21.049s
sys 0m0.456s
$ time python b.py
real 0m10.775s
user 0m9.817s
sys 0m0.236s
$ time python c.py
real 0m38.255s
user 0m32.718s
sys 0m0.724s
$ time python d.py
real 0m46.040s
user 0m41.515s
sys 0m0.832s
Looks like the regex is the winner so far.
Personally, I find the regex just as readable as the list comprehension. If you’re doing it just a few times then you’ll probably take a bigger hit on compiling the regex. Do what jives with your code and coding style.
Another ‘pythonic’ approach
filter( lambda x: x in '0123456789.', s )
but regex is faster.
If the set of characters were larger, using sets as below might be faster. As it is, this is a bit slower than a.py.
dec = set('1234567890.')
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
''.join(ch for ch in a if ch in dec)
At least on my system, you can save a tiny bit of time (and memory if your string were long enough to matter) by using a generator expression instead of a list comprehension in a.py:
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
''.join(c for c in a if c in '1234567890.')
Oh, and here’s the fastest way I’ve found by far on this test string (much faster than regex) if you are doing this many, many times and are willing to put up with the overhead of building a couple of character tables.
chrs = ''.join(chr(i) for i in xrange(256))
deletable = ''.join(ch for ch in chrs if ch not in '1234567890.')
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
a.translate(chrs, deletable)
On my system, that runs in ~1.0 seconds where the regex b.py runs in ~4.3 seconds.
import string
filter(lambda c: c in string.digits + '.', s)
A simple solution is to use regular expessions
import re
re.sub("[^0-9^.]", "", data)
John’s idea gave me the intention to work it out more deeply. I extended it with auto-recognition of unit abbreviations editable in the md dictionary. The key is the unit abbreviation and the value is the multiplier. In this way the applications are endless. The result is always a number with which you can count. Set the parameter toInt=True and the result is an Integer. Maybe not the fastest method, but I don’t have to worry anymore and always a reliable result.
import re
md = {'gr': 0.001, '%': 0.01, 'K': 1000, 'M': 1000000, 'B': 1000000000, 'ms': 0.001, 'mt': 1000}
kl = list(md.keys())
def str_to_float_or_Int(strVal, toInt=None):
toInt = False if toInt is None else toInt
def chck_char_in_string(strVal):
rs = None
for el in kl:
if el in strVal:
rs = el
break
return rs
strVal = strVal.strip()
mpk = chck_char_in_string(strVal)
mp = 1 if mpk is None else md[mpk]
strVal = re.sub(r'[^d.,-]+', '', strVal)
seps = re.sub(r'-?d', '', strVal, flags=re.U)
for sep in seps[:-1]:
strVal = strVal.replace(sep, '')
if seps:
strVal = strVal.replace(seps[-1], '.')
dcnm = float(strVal)
dcnm = dcnm * mp
dcnm = int(round(dcnm)) if toInt else dcnm
return dcnm
Call the function as follows:
Values = ['1,354852M', '+10.000,12 gr', '-45,145.01 K', '753,159.456', '-87,24%', '1,000,000', '10,2K', '985 ms', '(mt) 0,475', '888 745.23', ' ,159']
for val in Values:
result = str_to_float_or_Int(val)
print(result)
exit()
The output results:
1354852.0
10.00012
-45145010.0
753159.456
-0.8724
1000000.0
10200.0
0.985
475.0
888745.23
0.159
I’ve got a pretty good working snippit of code, but I was wondering if anyone has any better suggestions on how to do this:
val = ''.join([c for c in val if c in '1234567890.'])
What would you do?
You can use a regular expression (using the re
module) to accomplish the same thing. The example below matches runs of [^d.]
(any character that’s not a decimal digit or a period) and replaces them with the empty string. Note that if the pattern is compiled with the UNICODE
flag the resulting string could still include non-ASCII numbers. Also, the result after removing “non-numeric” characters is not necessarily a valid number.
>>> import re
>>> non_decimal = re.compile(r'[^d.]+')
>>> non_decimal.sub('', '12.34fe4e')
'12.344'
Here’s some sample code:
$ cat a.py
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
''.join([c for c in a if c in '1234567890.'])
$ cat b.py
import re
non_decimal = re.compile(r'[^d.]+')
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
non_decimal.sub('', a)
$ cat c.py
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
''.join([c for c in a if c.isdigit() or c == '.'])
$ cat d.py
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
b = []
for c in a:
if c.isdigit() or c == '.': continue
b.append(c)
''.join(b)
And the timing results:
$ time python a.py
real 0m24.735s
user 0m21.049s
sys 0m0.456s
$ time python b.py
real 0m10.775s
user 0m9.817s
sys 0m0.236s
$ time python c.py
real 0m38.255s
user 0m32.718s
sys 0m0.724s
$ time python d.py
real 0m46.040s
user 0m41.515s
sys 0m0.832s
Looks like the regex is the winner so far.
Personally, I find the regex just as readable as the list comprehension. If you’re doing it just a few times then you’ll probably take a bigger hit on compiling the regex. Do what jives with your code and coding style.
Another ‘pythonic’ approach
filter( lambda x: x in '0123456789.', s )
but regex is faster.
If the set of characters were larger, using sets as below might be faster. As it is, this is a bit slower than a.py.
dec = set('1234567890.')
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
''.join(ch for ch in a if ch in dec)
At least on my system, you can save a tiny bit of time (and memory if your string were long enough to matter) by using a generator expression instead of a list comprehension in a.py:
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
''.join(c for c in a if c in '1234567890.')
Oh, and here’s the fastest way I’ve found by far on this test string (much faster than regex) if you are doing this many, many times and are willing to put up with the overhead of building a couple of character tables.
chrs = ''.join(chr(i) for i in xrange(256))
deletable = ''.join(ch for ch in chrs if ch not in '1234567890.')
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
a.translate(chrs, deletable)
On my system, that runs in ~1.0 seconds where the regex b.py runs in ~4.3 seconds.
import string
filter(lambda c: c in string.digits + '.', s)
A simple solution is to use regular expessions
import re
re.sub("[^0-9^.]", "", data)
John’s idea gave me the intention to work it out more deeply. I extended it with auto-recognition of unit abbreviations editable in the md dictionary. The key is the unit abbreviation and the value is the multiplier. In this way the applications are endless. The result is always a number with which you can count. Set the parameter toInt=True and the result is an Integer. Maybe not the fastest method, but I don’t have to worry anymore and always a reliable result.
import re
md = {'gr': 0.001, '%': 0.01, 'K': 1000, 'M': 1000000, 'B': 1000000000, 'ms': 0.001, 'mt': 1000}
kl = list(md.keys())
def str_to_float_or_Int(strVal, toInt=None):
toInt = False if toInt is None else toInt
def chck_char_in_string(strVal):
rs = None
for el in kl:
if el in strVal:
rs = el
break
return rs
strVal = strVal.strip()
mpk = chck_char_in_string(strVal)
mp = 1 if mpk is None else md[mpk]
strVal = re.sub(r'[^d.,-]+', '', strVal)
seps = re.sub(r'-?d', '', strVal, flags=re.U)
for sep in seps[:-1]:
strVal = strVal.replace(sep, '')
if seps:
strVal = strVal.replace(seps[-1], '.')
dcnm = float(strVal)
dcnm = dcnm * mp
dcnm = int(round(dcnm)) if toInt else dcnm
return dcnm
Call the function as follows:
Values = ['1,354852M', '+10.000,12 gr', '-45,145.01 K', '753,159.456', '-87,24%', '1,000,000', '10,2K', '985 ms', '(mt) 0,475', '888 745.23', ' ,159']
for val in Values:
result = str_to_float_or_Int(val)
print(result)
exit()
The output results:
1354852.0
10.00012
-45145010.0
753159.456
-0.8724
1000000.0
10200.0
0.985
475.0
888745.23
0.159