Python: Detect number separator symbols and parse into a float without locale
Question:
I have a dataset with millions of text files with numbers saved as strings and using a variety of locales to format the number. What I am trying to do is guess which symbol is the decimal separator and which is the thousand separator.
This shouldn’t be too hard but it seems the question hasn’t been asked yet and for posterity it should be asked and answered here.
What I do know is that there is always a decimal separator and it is always the last non[0-9] symbol in the string.
As you can see below a simple numStr.replace(',', '.')
to fix the variations in decimal separators will conflict with the possible thousand separators.
I have seen ways of doing it if you know the locale but I do NOT know the locale in this instance.
Dataset:
1.0000 //1.0
1,0000 //1.0
10,000.0000 //10000.0
10.000,0000 //10000.0
1,000,000.0000 // 1000000.0
1.000.000,0000 // 1000000.0
//also possible
1 000 000.0000 //1000000.0 with spaces as thousand separators
Answers:
One approach:
import re
with open('numbers') as fhandle:
for line in fhandle:
line = line.strip()
separators = re.sub('[0-9]', '', line)
for sep in separators[:-1]:
line = line.replace(sep, '')
if separators:
line = line.replace(separators[-1], '.')
print(line)
On your sample input (comments removed), the output is:
1.0000
1.0000
10000.0000
10000.0000
1000000.0000
1000000.0000
1000000.0000
Update: Handling Unicode
As NeoZenith points out in the comments, with modern unicode fonts, the venerable regular expression [0-9]
is not reliable. Use the following instead:
import re
with open('numbers') as fhandle:
for line in fhandle:
line = line.strip()
separators = re.sub(r'd', '', line, flags=re.U)
for sep in separators[:-1]:
line = line.replace(sep, '')
if separators:
line = line.replace(separators[-1], '.')
print(line)
Without the re.U
flag, d
is equivalent to [0-9]
. With that flag, d
matches whatever is classified as a decimal digit in the Unicode character properties database. Alternatively, for handling unusual digit characters, one may want to consider using unicode.translate
.
Another approach that also checks for wrong number formatting, notifies of possible wrong interpretation, and is faster than the current solution (performance reports below):
import re
pattern_comma_thousands_dot_decimal = re.compile(r'^[-+]?((d{1,3}(,d{3})*)|(d*))(.|.d*)?$')
pattern_dot_thousands_comma_decimal = re.compile(r'^[-+]?((d{1,3}(.d{3})*)|(d*))(,|,d*)?$')
pattern_confusion_dot_thousands = re.compile(r'^(?:[-+]?(?=.*d)(?=.*[1-9]).{1,3}.d{3})$') # for numbers like '100.000' (is it 100.0 or 100000?)
pattern_confusion_comma_thousands = re.compile(r'^(?:[-+]?(?=.*d)(?=.*[1-9]).{1,3},d{3})$') # for numbers like '100,000' (is it 100.0 or 100000?)
def parse_number_with_guess_for_separator_chars(number_str: str, max_val=None):
"""
Tries to guess the thousands and decimal characters (comma or dot) and converts the string number accordingly.
The return also indicates if the correctness of the result is certain or uncertain
:param number_str: a string with the number to convert
:param max_val: an optional parameter determining the allowed maximum value.
This helps prevent mistaking the decimal separator as a thousands separator.
For instance, if max_val is 101 then the string '100.000' which would be
interpreted as 100000.0 will instead be interpreted as 100.0
:return: a tuple with the number as a float an a flag (`True` if certain and `False` if uncertain)
"""
number_str = number_str.strip().lstrip('0')
certain = True
if pattern_confusion_dot_thousands.match(number_str) is not None:
number_str = number_str.replace('.', '') # assume dot is thousands separator
certain = False
elif pattern_confusion_comma_thousands.match(number_str) is not None:
number_str = number_str.replace(',', '') # assume comma is thousands separator
certain = False
elif pattern_comma_thousands_dot_decimal.match(number_str) is not None:
number_str = number_str.replace(',', '')
elif pattern_dot_thousands_comma_decimal.match(number_str) is not None:
number_str = number_str.replace('.', '').replace(',', '.')
else:
raise ValueError() # For stuff like '10,000.000,0' and other nonsense
number = float(number_str)
if not certain and max_val is not None and number > max_val:
number *= 0.001 # Change previous assumption to decimal separator, so '100.000' goes from 100000.0 to 100.0
certain = True # Since this uniquely satisfies the given constraint, it should be a certainly correct interpretation
return number, certain
Performance in worst case:
python -m timeit "parse_number_with_guess_for_separator_chars('10,043,353.23')"
100000 loops, best of 5: 2.01 usec per loop
python -m timeit "John1024_solution('10.089.434,54')"
100000 loops, best of 5: 3.04 usec per loop
Performance in best case:
python -m timeit "parse_number_with_guess_for_separator_chars('10.089')"
500000 loops, best of 5: 946 nsec per loop
python -m timeit "John1024_solution('10.089')"
100000 loops, best of 5: 1.97 usec per loop
John’s idea gave me the intention to work it out more deeply. I extended it with auto-recognition of unit abbreviations editable in the md dictionary. The key is the unit abbreviation and the value is the multiplier. In this way the applications are endless. The result is always a number with which you can count. Set the parameter toInt=True and the result is an Integer. Maybe not the fastest method, but I don’t have to worry anymore and always a reliable result.
import re
md = {'gr': 0.001, '%': 0.01, 'K': 1000, 'M': 1000000, 'B': 1000000000, 'ms': 0.001, 'mt': 1000}
kl = list(md.keys())
def str_to_float_or_Int(strVal, toInt=None):
toInt = False if toInt is None else toInt
def chck_char_in_string(strVal):
rs = None
for el in kl:
if el in strVal:
rs = el
break
return rs
strVal = strVal.strip()
mpk = chck_char_in_string(strVal)
mp = 1 if mpk is None else md[mpk]
strVal = re.sub(r'[^d.,-]+', '', strVal)
seps = re.sub(r'-?d', '', strVal, flags=re.U)
for sep in seps[:-1]:
strVal = strVal.replace(sep, '')
if seps:
strVal = strVal.replace(seps[-1], '.')
dcnm = float(strVal)
dcnm = dcnm * mp
dcnm = int(round(dcnm)) if toInt else dcnm
return dcnm
Call the function as follows:
Values = ['1,354852M', '+10.000,12 gr', '-45,145.01 K', '753,159.456', '-87,24%', '1,000,000', '10,2K', '985 ms', '(mt) 0,475', '888 745.23', ' ,159']
for val in Values:
result = str_to_float_or_Int(val)
print(result)
exit()
The output results:
1354852.0
10.00012
-45145010.0
753159.456
-0.8724
1000000.0
10200.0
0.985
475.0
888745.23
0.159
I have a dataset with millions of text files with numbers saved as strings and using a variety of locales to format the number. What I am trying to do is guess which symbol is the decimal separator and which is the thousand separator.
This shouldn’t be too hard but it seems the question hasn’t been asked yet and for posterity it should be asked and answered here.
What I do know is that there is always a decimal separator and it is always the last non[0-9] symbol in the string.
As you can see below a simple numStr.replace(',', '.')
to fix the variations in decimal separators will conflict with the possible thousand separators.
I have seen ways of doing it if you know the locale but I do NOT know the locale in this instance.
Dataset:
1.0000 //1.0
1,0000 //1.0
10,000.0000 //10000.0
10.000,0000 //10000.0
1,000,000.0000 // 1000000.0
1.000.000,0000 // 1000000.0
//also possible
1 000 000.0000 //1000000.0 with spaces as thousand separators
One approach:
import re
with open('numbers') as fhandle:
for line in fhandle:
line = line.strip()
separators = re.sub('[0-9]', '', line)
for sep in separators[:-1]:
line = line.replace(sep, '')
if separators:
line = line.replace(separators[-1], '.')
print(line)
On your sample input (comments removed), the output is:
1.0000
1.0000
10000.0000
10000.0000
1000000.0000
1000000.0000
1000000.0000
Update: Handling Unicode
As NeoZenith points out in the comments, with modern unicode fonts, the venerable regular expression [0-9]
is not reliable. Use the following instead:
import re
with open('numbers') as fhandle:
for line in fhandle:
line = line.strip()
separators = re.sub(r'd', '', line, flags=re.U)
for sep in separators[:-1]:
line = line.replace(sep, '')
if separators:
line = line.replace(separators[-1], '.')
print(line)
Without the re.U
flag, d
is equivalent to [0-9]
. With that flag, d
matches whatever is classified as a decimal digit in the Unicode character properties database. Alternatively, for handling unusual digit characters, one may want to consider using unicode.translate
.
Another approach that also checks for wrong number formatting, notifies of possible wrong interpretation, and is faster than the current solution (performance reports below):
import re
pattern_comma_thousands_dot_decimal = re.compile(r'^[-+]?((d{1,3}(,d{3})*)|(d*))(.|.d*)?$')
pattern_dot_thousands_comma_decimal = re.compile(r'^[-+]?((d{1,3}(.d{3})*)|(d*))(,|,d*)?$')
pattern_confusion_dot_thousands = re.compile(r'^(?:[-+]?(?=.*d)(?=.*[1-9]).{1,3}.d{3})$') # for numbers like '100.000' (is it 100.0 or 100000?)
pattern_confusion_comma_thousands = re.compile(r'^(?:[-+]?(?=.*d)(?=.*[1-9]).{1,3},d{3})$') # for numbers like '100,000' (is it 100.0 or 100000?)
def parse_number_with_guess_for_separator_chars(number_str: str, max_val=None):
"""
Tries to guess the thousands and decimal characters (comma or dot) and converts the string number accordingly.
The return also indicates if the correctness of the result is certain or uncertain
:param number_str: a string with the number to convert
:param max_val: an optional parameter determining the allowed maximum value.
This helps prevent mistaking the decimal separator as a thousands separator.
For instance, if max_val is 101 then the string '100.000' which would be
interpreted as 100000.0 will instead be interpreted as 100.0
:return: a tuple with the number as a float an a flag (`True` if certain and `False` if uncertain)
"""
number_str = number_str.strip().lstrip('0')
certain = True
if pattern_confusion_dot_thousands.match(number_str) is not None:
number_str = number_str.replace('.', '') # assume dot is thousands separator
certain = False
elif pattern_confusion_comma_thousands.match(number_str) is not None:
number_str = number_str.replace(',', '') # assume comma is thousands separator
certain = False
elif pattern_comma_thousands_dot_decimal.match(number_str) is not None:
number_str = number_str.replace(',', '')
elif pattern_dot_thousands_comma_decimal.match(number_str) is not None:
number_str = number_str.replace('.', '').replace(',', '.')
else:
raise ValueError() # For stuff like '10,000.000,0' and other nonsense
number = float(number_str)
if not certain and max_val is not None and number > max_val:
number *= 0.001 # Change previous assumption to decimal separator, so '100.000' goes from 100000.0 to 100.0
certain = True # Since this uniquely satisfies the given constraint, it should be a certainly correct interpretation
return number, certain
Performance in worst case:
python -m timeit "parse_number_with_guess_for_separator_chars('10,043,353.23')"
100000 loops, best of 5: 2.01 usec per loop
python -m timeit "John1024_solution('10.089.434,54')"
100000 loops, best of 5: 3.04 usec per loop
Performance in best case:
python -m timeit "parse_number_with_guess_for_separator_chars('10.089')"
500000 loops, best of 5: 946 nsec per loop
python -m timeit "John1024_solution('10.089')"
100000 loops, best of 5: 1.97 usec per loop
John’s idea gave me the intention to work it out more deeply. I extended it with auto-recognition of unit abbreviations editable in the md dictionary. The key is the unit abbreviation and the value is the multiplier. In this way the applications are endless. The result is always a number with which you can count. Set the parameter toInt=True and the result is an Integer. Maybe not the fastest method, but I don’t have to worry anymore and always a reliable result.
import re
md = {'gr': 0.001, '%': 0.01, 'K': 1000, 'M': 1000000, 'B': 1000000000, 'ms': 0.001, 'mt': 1000}
kl = list(md.keys())
def str_to_float_or_Int(strVal, toInt=None):
toInt = False if toInt is None else toInt
def chck_char_in_string(strVal):
rs = None
for el in kl:
if el in strVal:
rs = el
break
return rs
strVal = strVal.strip()
mpk = chck_char_in_string(strVal)
mp = 1 if mpk is None else md[mpk]
strVal = re.sub(r'[^d.,-]+', '', strVal)
seps = re.sub(r'-?d', '', strVal, flags=re.U)
for sep in seps[:-1]:
strVal = strVal.replace(sep, '')
if seps:
strVal = strVal.replace(seps[-1], '.')
dcnm = float(strVal)
dcnm = dcnm * mp
dcnm = int(round(dcnm)) if toInt else dcnm
return dcnm
Call the function as follows:
Values = ['1,354852M', '+10.000,12 gr', '-45,145.01 K', '753,159.456', '-87,24%', '1,000,000', '10,2K', '985 ms', '(mt) 0,475', '888 745.23', ' ,159']
for val in Values:
result = str_to_float_or_Int(val)
print(result)
exit()
The output results:
1354852.0
10.00012
-45145010.0
753159.456
-0.8724
1000000.0
10200.0
0.985
475.0
888745.23
0.159