Separate number from unit in a string in Python
Question:
I have strings containing numbers with their units, e.g. 2GB, 17ft, etc.
I would like to separate the number from the unit and create 2 different strings. Sometimes, there is a whitespace between them (e.g. 2 GB) and it’s easy to do it using split(‘ ‘).
When they are together (e.g. 2GB), I would test every character until I find a letter, instead of a number.
s='17GB'
number=''
unit=''
for c in s:
if c.isdigit():
number+=c
else:
unit+=c
Is there a better way to do it?
Thanks
Answers:
You could use a regular expression to divide the string into groups:
>>> import re
>>> p = re.compile('(d+)s*(w+)')
>>> p.match('2GB').groups()
('2', 'GB')
>>> p.match('17 ft').groups()
('17', 'ft')
How about using a regular expression
You should use regular expressions, grouping together what you want to find out:
import re
s = "17GB"
match = re.match(r"^([1-9][0-9]*)s*(GB|MB|KB|B)$", s)
if match:
print "Number: %d, unit: %s" % (int(match.group(1)), match.group(2))
Change the regex according to what you want to parse. If you’re unfamiliar with regular expressions, here’s a great tutorial site.
tokenize
can help:
>>> import StringIO
>>> s = StringIO.StringIO('27GB')
>>> for token in tokenize.generate_tokens(s.readline):
... print token
...
(2, '27', (1, 0), (1, 2), '27GB')
(1, 'GB', (1, 2), (1, 4), '27GB')
(0, '', (2, 0), (2, 0), '')
For this task, I would definitely use a regular expression:
import re
there = re.compile(r's*(d+)s*(S+)')
thematch = there.match(s)
if thematch:
number, unit = thematch.groups()
else:
raise ValueError('String %r not in the expected format' % s)
In the RE pattern, s
means “whitespace”, d
means “digit”, S
means non-whitespace; *
means “0 or more of the preceding”, +
means “1 or more of the preceding, and the parentheses enclose “capturing groups” which are then returned by the groups()
call on the match-object. (thematch
is None if the given string doesn’t correspond to the pattern: optional whitespace, then one or more digits, then optional whitespace, then one or more non-whitespace characters).
A regular expression.
import re
m = re.match(r's*(?P<n>[-+]?[.0-9])s*(?P<u>.*)', s)
if m is None:
raise ValueError("not a number with units")
number = m.group("n")
unit = m.group("u")
This will give you a number (integer or fixed point; too hard to disambiguate scientific notation’s “e” from a unit prefix) with an optional sign, followed by the units, with optional whitespace.
You can use re.compile()
if you’re going to be doing a lot of matches.
s='17GB'
for i,c in enumerate(s):
if not c.isdigit():
break
number=int(s[:i])
unit=s[i:]
You can break out of the loop when you find the first non-digit character
for i,c in enumerate(s):
if not c.isdigit():
break
number = s[:i]
unit = s[i:].lstrip()
If you have negative and decimals:
numeric = '0123456789-.'
for i,c in enumerate(s):
if c not in numeric:
break
number = s[:i]
unit = s[i:].lstrip()
>>> s="17GB"
>>> ind=map(str.isalpha,s).index(True)
>>> num,suffix=s[:ind],s[ind:]
>>> print num+":"+suffix
17:GB
This uses an approach which should be a bit more forgiving than regexes. Note: this is not as performant as the other solutions posted.
def split_units(value):
"""
>>> split_units("2GB")
(2.0, 'GB')
>>> split_units("17 ft")
(17.0, 'ft')
>>> split_units(" 3.4e-27 frobnitzem ")
(3.4e-27, 'frobnitzem')
>>> split_units("9001")
(9001.0, '')
>>> split_units("spam sandwhiches")
(0, 'spam sandwhiches')
>>> split_units("")
(0, '')
"""
units = ""
number = 0
while value:
try:
number = float(value)
break
except ValueError:
units = value[-1:] + units
value = value[:-1]
return number, units.strip()
SCIENTIFIC NOTATION
This regex is working well for me to parse numbers that may be in scientific notation, and is based on the recent python documentation about scanf:
https://docs.python.org/3/library/re.html#simulating-scanf
units_pattern = re.compile("([-+]?(d+(.d*)?|.d+)([eE][-+]?d+)?|s*[a-zA-Z]+s*$)")
number_with_units = list(match.group(0) for match in units_pattern.finditer("+2.0e-1 mm"))
print(number_with_units)
>>>['+2.0e-1', ' mm']
n, u = number_with_units
print(float(n), u.strip())
>>>0.2 mm
try the regex pattern below. the first group (the scanf() tokens for a number any which way) is lifted directly from the python docs for the re module.
import re
SCANF_MEASUREMENT = re.compile(
r'''( # group match like scanf() token %e, %E, %f, %g
[-+]? # +/- or nothing for positive
(d+(.d*)?|.d+) # match numbers: 1, 1., 1.1, .1
([eE][-+]?d+)? # scientific notation: e(+/-)2 (*10^2)
)
(s*) # separator: white space or nothing
( # unit of measure: like GB. also works for no units
S*)''', re.VERBOSE)
'''
:var SCANF_MEASUREMENT:
regular expression object that will match a measurement
**measurement** is the value of a quantity of something. most complicated example::
-666.6e-100 units
'''
def parse_measurement(value_sep_units):
measurement = re.match(SCANF_MEASUREMENT, value_sep_units)
try:
value = float(measurement[0])
except ValueError:
print 'doesn't start with a number', value_sep_units
units = measurement[5]
return value, units
This kind of parser is already integrated into Pint:
Pint is a Python package to define, operate and manipulate physical
quantities: the product of a numerical value and a unit of
measurement. It allows arithmetic operations between them and
conversions from and to different units.
You can install it with pip install pint
.
Then, you can parse a string, get the desired value (‘magnitude’) and its unit:
>>> from pint import UnitRegistry
>>> ureg = UnitRegistry()
>>> size = ureg('2GB')
>>> size.m
2
>>> size.u
<Unit('gigabyte')>
>>> size.to('GiB')
<Quantity(1.86264515, 'gibibyte')>
>>> length = ureg('17ft')
>>> length.m
17
>>> length.u
<Unit('foot')>
>>> length.to('cm')
<Quantity(518.16, 'centimeter')>
Unfortunately, none of the previous codes worked correctly in my situation. I developed the following code. The idea behind the code is that every number ends with a digit or dot.
def splitValUnit(s):
s = s.replace(' ', '')
lastIndex = len(s) - 1
i = lastIndex
for i in range(lastIndex, -1, -1):
if (s[i].isdigit() or s[i] == '.'):
break
i = i + 1
value = 0
unit = ''
try:
value = float(s[:i])
unit = s[i:]
except:
pass
return {'value': value, 'unit': unit}
print(splitValUnit('7')) #{'value': 7.0, 'unit': ''}
print(splitValUnit('+7')) #{'value': 7.0, 'unit': ''}
print(splitValUnit('7m')) #{'value': 7.0, 'unit': 'm'}
print(splitValUnit('27')) #{'value': 27.0, 'unit': ''}
print(splitValUnit('7.')) #{'value': 7.0, 'unit': ''}
print(splitValUnit('2GHz')) #{'value': 2.0, 'unit': 'GHz'}
print(splitValUnit('+2.e-10H')) #{'value': 2e-10, 'unit': 'H'}
print(splitValUnit('2.3e+4 MegaOhm'))#{'value': 23000.0, 'unit': 'MegaOhm'}
print(splitValUnit('-4.')) #{'value': -4.0, 'unit': ''}
print(splitValUnit('e mm')) #{'value': 0, 'unit': ''}
print(splitValUnit('')) #{'value': 0, 'unit': ''}
I have strings containing numbers with their units, e.g. 2GB, 17ft, etc.
I would like to separate the number from the unit and create 2 different strings. Sometimes, there is a whitespace between them (e.g. 2 GB) and it’s easy to do it using split(‘ ‘).
When they are together (e.g. 2GB), I would test every character until I find a letter, instead of a number.
s='17GB'
number=''
unit=''
for c in s:
if c.isdigit():
number+=c
else:
unit+=c
Is there a better way to do it?
Thanks
You could use a regular expression to divide the string into groups:
>>> import re
>>> p = re.compile('(d+)s*(w+)')
>>> p.match('2GB').groups()
('2', 'GB')
>>> p.match('17 ft').groups()
('17', 'ft')
How about using a regular expression
You should use regular expressions, grouping together what you want to find out:
import re
s = "17GB"
match = re.match(r"^([1-9][0-9]*)s*(GB|MB|KB|B)$", s)
if match:
print "Number: %d, unit: %s" % (int(match.group(1)), match.group(2))
Change the regex according to what you want to parse. If you’re unfamiliar with regular expressions, here’s a great tutorial site.
tokenize
can help:
>>> import StringIO
>>> s = StringIO.StringIO('27GB')
>>> for token in tokenize.generate_tokens(s.readline):
... print token
...
(2, '27', (1, 0), (1, 2), '27GB')
(1, 'GB', (1, 2), (1, 4), '27GB')
(0, '', (2, 0), (2, 0), '')
For this task, I would definitely use a regular expression:
import re
there = re.compile(r's*(d+)s*(S+)')
thematch = there.match(s)
if thematch:
number, unit = thematch.groups()
else:
raise ValueError('String %r not in the expected format' % s)
In the RE pattern, s
means “whitespace”, d
means “digit”, S
means non-whitespace; *
means “0 or more of the preceding”, +
means “1 or more of the preceding, and the parentheses enclose “capturing groups” which are then returned by the groups()
call on the match-object. (thematch
is None if the given string doesn’t correspond to the pattern: optional whitespace, then one or more digits, then optional whitespace, then one or more non-whitespace characters).
A regular expression.
import re
m = re.match(r's*(?P<n>[-+]?[.0-9])s*(?P<u>.*)', s)
if m is None:
raise ValueError("not a number with units")
number = m.group("n")
unit = m.group("u")
This will give you a number (integer or fixed point; too hard to disambiguate scientific notation’s “e” from a unit prefix) with an optional sign, followed by the units, with optional whitespace.
You can use re.compile()
if you’re going to be doing a lot of matches.
s='17GB'
for i,c in enumerate(s):
if not c.isdigit():
break
number=int(s[:i])
unit=s[i:]
You can break out of the loop when you find the first non-digit character
for i,c in enumerate(s):
if not c.isdigit():
break
number = s[:i]
unit = s[i:].lstrip()
If you have negative and decimals:
numeric = '0123456789-.'
for i,c in enumerate(s):
if c not in numeric:
break
number = s[:i]
unit = s[i:].lstrip()
>>> s="17GB"
>>> ind=map(str.isalpha,s).index(True)
>>> num,suffix=s[:ind],s[ind:]
>>> print num+":"+suffix
17:GB
This uses an approach which should be a bit more forgiving than regexes. Note: this is not as performant as the other solutions posted.
def split_units(value):
"""
>>> split_units("2GB")
(2.0, 'GB')
>>> split_units("17 ft")
(17.0, 'ft')
>>> split_units(" 3.4e-27 frobnitzem ")
(3.4e-27, 'frobnitzem')
>>> split_units("9001")
(9001.0, '')
>>> split_units("spam sandwhiches")
(0, 'spam sandwhiches')
>>> split_units("")
(0, '')
"""
units = ""
number = 0
while value:
try:
number = float(value)
break
except ValueError:
units = value[-1:] + units
value = value[:-1]
return number, units.strip()
SCIENTIFIC NOTATION
This regex is working well for me to parse numbers that may be in scientific notation, and is based on the recent python documentation about scanf:
https://docs.python.org/3/library/re.html#simulating-scanf
units_pattern = re.compile("([-+]?(d+(.d*)?|.d+)([eE][-+]?d+)?|s*[a-zA-Z]+s*$)")
number_with_units = list(match.group(0) for match in units_pattern.finditer("+2.0e-1 mm"))
print(number_with_units)
>>>['+2.0e-1', ' mm']
n, u = number_with_units
print(float(n), u.strip())
>>>0.2 mm
try the regex pattern below. the first group (the scanf() tokens for a number any which way) is lifted directly from the python docs for the re module.
import re
SCANF_MEASUREMENT = re.compile(
r'''( # group match like scanf() token %e, %E, %f, %g
[-+]? # +/- or nothing for positive
(d+(.d*)?|.d+) # match numbers: 1, 1., 1.1, .1
([eE][-+]?d+)? # scientific notation: e(+/-)2 (*10^2)
)
(s*) # separator: white space or nothing
( # unit of measure: like GB. also works for no units
S*)''', re.VERBOSE)
'''
:var SCANF_MEASUREMENT:
regular expression object that will match a measurement
**measurement** is the value of a quantity of something. most complicated example::
-666.6e-100 units
'''
def parse_measurement(value_sep_units):
measurement = re.match(SCANF_MEASUREMENT, value_sep_units)
try:
value = float(measurement[0])
except ValueError:
print 'doesn't start with a number', value_sep_units
units = measurement[5]
return value, units
This kind of parser is already integrated into Pint:
Pint is a Python package to define, operate and manipulate physical
quantities: the product of a numerical value and a unit of
measurement. It allows arithmetic operations between them and
conversions from and to different units.
You can install it with pip install pint
.
Then, you can parse a string, get the desired value (‘magnitude’) and its unit:
>>> from pint import UnitRegistry
>>> ureg = UnitRegistry()
>>> size = ureg('2GB')
>>> size.m
2
>>> size.u
<Unit('gigabyte')>
>>> size.to('GiB')
<Quantity(1.86264515, 'gibibyte')>
>>> length = ureg('17ft')
>>> length.m
17
>>> length.u
<Unit('foot')>
>>> length.to('cm')
<Quantity(518.16, 'centimeter')>
Unfortunately, none of the previous codes worked correctly in my situation. I developed the following code. The idea behind the code is that every number ends with a digit or dot.
def splitValUnit(s):
s = s.replace(' ', '')
lastIndex = len(s) - 1
i = lastIndex
for i in range(lastIndex, -1, -1):
if (s[i].isdigit() or s[i] == '.'):
break
i = i + 1
value = 0
unit = ''
try:
value = float(s[:i])
unit = s[i:]
except:
pass
return {'value': value, 'unit': unit}
print(splitValUnit('7')) #{'value': 7.0, 'unit': ''}
print(splitValUnit('+7')) #{'value': 7.0, 'unit': ''}
print(splitValUnit('7m')) #{'value': 7.0, 'unit': 'm'}
print(splitValUnit('27')) #{'value': 27.0, 'unit': ''}
print(splitValUnit('7.')) #{'value': 7.0, 'unit': ''}
print(splitValUnit('2GHz')) #{'value': 2.0, 'unit': 'GHz'}
print(splitValUnit('+2.e-10H')) #{'value': 2e-10, 'unit': 'H'}
print(splitValUnit('2.3e+4 MegaOhm'))#{'value': 23000.0, 'unit': 'MegaOhm'}
print(splitValUnit('-4.')) #{'value': -4.0, 'unit': ''}
print(splitValUnit('e mm')) #{'value': 0, 'unit': ''}
print(splitValUnit('')) #{'value': 0, 'unit': ''}