Separate number from unit in a string in Python

Question:

I have strings containing numbers with their units, e.g. 2GB, 17ft, etc.
I would like to separate the number from the unit and create 2 different strings. Sometimes, there is a whitespace between them (e.g. 2 GB) and it’s easy to do it using split(‘ ‘).

When they are together (e.g. 2GB), I would test every character until I find a letter, instead of a number.

s='17GB'
number=''
unit=''
for c in s:
    if c.isdigit():
        number+=c
    else:
        unit+=c

Is there a better way to do it?

Thanks

Asked By: duduklein

||

Answers:

You could use a regular expression to divide the string into groups:

>>> import re
>>> p = re.compile('(d+)s*(w+)')
>>> p.match('2GB').groups()
('2', 'GB')
>>> p.match('17 ft').groups()
('17', 'ft')
Answered By: Jarret Hardie

How about using a regular expression

http://python.org/doc/1.6/lib/module-regsub.html

Answered By: Ole Media

You should use regular expressions, grouping together what you want to find out:

import re
s = "17GB"
match = re.match(r"^([1-9][0-9]*)s*(GB|MB|KB|B)$", s)
if match:
  print "Number: %d, unit: %s" % (int(match.group(1)), match.group(2))

Change the regex according to what you want to parse. If you’re unfamiliar with regular expressions, here’s a great tutorial site.

Answered By: AndiDog

tokenize can help:

>>> import StringIO
>>> s = StringIO.StringIO('27GB')
>>> for token in tokenize.generate_tokens(s.readline):
...   print token
... 
(2, '27', (1, 0), (1, 2), '27GB')
(1, 'GB', (1, 2), (1, 4), '27GB')
(0, '', (2, 0), (2, 0), '')

For this task, I would definitely use a regular expression:

import re
there = re.compile(r's*(d+)s*(S+)')
thematch = there.match(s)
if thematch:
  number, unit = thematch.groups()
else:
  raise ValueError('String %r not in the expected format' % s)

In the RE pattern, s means “whitespace”, d means “digit”, S means non-whitespace; * means “0 or more of the preceding”, + means “1 or more of the preceding, and the parentheses enclose “capturing groups” which are then returned by the groups() call on the match-object. (thematch is None if the given string doesn’t correspond to the pattern: optional whitespace, then one or more digits, then optional whitespace, then one or more non-whitespace characters).

Answered By: Alex Martelli

A regular expression.

import re

m = re.match(r's*(?P<n>[-+]?[.0-9])s*(?P<u>.*)', s)
if m is None:
  raise ValueError("not a number with units")
number = m.group("n")
unit = m.group("u")

This will give you a number (integer or fixed point; too hard to disambiguate scientific notation’s “e” from a unit prefix) with an optional sign, followed by the units, with optional whitespace.

You can use re.compile() if you’re going to be doing a lot of matches.

Answered By: Mike DeSimone
s='17GB'
for i,c in enumerate(s):
    if not c.isdigit():
        break
number=int(s[:i])
unit=s[i:]
Answered By: John La Rooy

You can break out of the loop when you find the first non-digit character

for i,c in enumerate(s):
    if not c.isdigit():
        break
number = s[:i]
unit = s[i:].lstrip()

If you have negative and decimals:

numeric = '0123456789-.'
for i,c in enumerate(s):
    if c not in numeric:
        break
number = s[:i]
unit = s[i:].lstrip()
Answered By: pwdyson
>>> s="17GB"
>>> ind=map(str.isalpha,s).index(True)
>>> num,suffix=s[:ind],s[ind:]
>>> print num+":"+suffix
17:GB
Answered By: ghostdog74

This uses an approach which should be a bit more forgiving than regexes. Note: this is not as performant as the other solutions posted.

def split_units(value):
    """
    >>> split_units("2GB")
    (2.0, 'GB')
    >>> split_units("17 ft")
    (17.0, 'ft')
    >>> split_units("   3.4e-27 frobnitzem ")
    (3.4e-27, 'frobnitzem')
    >>> split_units("9001")
    (9001.0, '')
    >>> split_units("spam sandwhiches")
    (0, 'spam sandwhiches')
    >>> split_units("")
    (0, '')
    """
    units = ""
    number = 0
    while value:
        try:
            number = float(value)
            break
        except ValueError:
            units = value[-1:] + units
            value = value[:-1]
    return number, units.strip()
Answered By: Logan Evans

SCIENTIFIC NOTATION
This regex is working well for me to parse numbers that may be in scientific notation, and is based on the recent python documentation about scanf:
https://docs.python.org/3/library/re.html#simulating-scanf

units_pattern = re.compile("([-+]?(d+(.d*)?|.d+)([eE][-+]?d+)?|s*[a-zA-Z]+s*$)")
number_with_units = list(match.group(0) for match in units_pattern.finditer("+2.0e-1 mm"))
print(number_with_units)
>>>['+2.0e-1', ' mm']

n, u = number_with_units
print(float(n), u.strip())
>>>0.2 mm
Answered By: Vince W.

try the regex pattern below. the first group (the scanf() tokens for a number any which way) is lifted directly from the python docs for the re module.

import re
SCANF_MEASUREMENT = re.compile(
    r'''(                      # group match like scanf() token %e, %E, %f, %g
    [-+]?                      # +/- or nothing for positive
    (d+(.d*)?|.d+)        # match numbers: 1, 1., 1.1, .1
    ([eE][-+]?d+)?            # scientific notation: e(+/-)2 (*10^2)
    )
    (s*)                      # separator: white space or nothing
    (                          # unit of measure: like GB. also works for no units
    S*)''',    re.VERBOSE)
'''
:var SCANF_MEASUREMENT:
    regular expression object that will match a measurement

    **measurement** is the value of a quantity of something. most complicated example::

        -666.6e-100 units
'''

def parse_measurement(value_sep_units):
    measurement = re.match(SCANF_MEASUREMENT, value_sep_units)
    try:
        value = float(measurement[0])
    except ValueError:
        print 'doesn't start with a number', value_sep_units
    units = measurement[5]

    return value, units
Answered By: steodatus

This kind of parser is already integrated into Pint:

Pint is a Python package to define, operate and manipulate physical
quantities: the product of a numerical value and a unit of
measurement. It allows arithmetic operations between them and
conversions from and to different units.

You can install it with pip install pint.

Then, you can parse a string, get the desired value (‘magnitude’) and its unit:

>>> from pint import UnitRegistry
>>> ureg = UnitRegistry()
>>> size = ureg('2GB')
>>> size.m
2
>>> size.u
<Unit('gigabyte')>
>>> size.to('GiB')
<Quantity(1.86264515, 'gibibyte')>
>>> length = ureg('17ft')
>>> length.m
17
>>> length.u
<Unit('foot')>
>>> length.to('cm')
<Quantity(518.16, 'centimeter')>
Answered By: Eric Duminil

Unfortunately, none of the previous codes worked correctly in my situation. I developed the following code. The idea behind the code is that every number ends with a digit or dot.

def splitValUnit(s):

    s = s.replace(' ', '')
    lastIndex = len(s) - 1
    i = lastIndex
    for i in range(lastIndex, -1, -1):
        if (s[i].isdigit() or s[i] == '.'):
            break
        
    i = i + 1

    value = 0
    unit = ''
    try:
        value = float(s[:i])
        unit = s[i:]
    except:
        pass

    return {'value': value, 'unit': unit}

print(splitValUnit('7'))             #{'value': 7.0, 'unit': ''}
print(splitValUnit('+7'))            #{'value': 7.0, 'unit': ''}
print(splitValUnit('7m'))            #{'value': 7.0, 'unit': 'm'}
print(splitValUnit('27'))            #{'value': 27.0, 'unit': ''}
print(splitValUnit('7.'))            #{'value': 7.0, 'unit': ''}
print(splitValUnit('2GHz'))          #{'value': 2.0, 'unit': 'GHz'}
print(splitValUnit('+2.e-10H'))      #{'value': 2e-10, 'unit': 'H'}
print(splitValUnit('2.3e+4 MegaOhm'))#{'value': 23000.0, 'unit': 'MegaOhm'}
print(splitValUnit('-4.'))           #{'value': -4.0, 'unit': ''}
print(splitValUnit('e mm'))          #{'value': 0, 'unit': ''}
print(splitValUnit(''))              #{'value': 0, 'unit': ''}
Answered By: Farhad