Perform a replacement on a string with several calls to the re.sub() method in a specific order and conditioned by these regex

Question:

import re

#Example 1
input_str = "creo que hay 330 trillones 2 billones 18 millones 320 mil 459 47475822"


#Example 2
input_str = "sumaria 6 cuatrillones 789 billones 320 mil a esta otra cantidad de elementos  47475822 y eso daría por resultado varios millones o trillones de unidades"
mil = 1000
2 mil = 2000
322 mil = 322000

1 millon = 1000000
2 millones = 2000000
1 billon = 1000000000000
25 billones = 25000000000000
1 trillon = 1000000000000000000
3 trillones = 3000000000000000000
1 cuatrillon = 1000000000000000000000000

mil = 1 digit followed 3 digits

millon = 1 digit followed 6 digits

billon = 1 digit followed 6+6 digits

trillon = 1 digit followed 6+6+6 digits

cuatrillon = 1 digital followed 6+6+6+6 digits

The difference between them is 6, always 6 digits, which if they are not complete, they are indicated as 0, since the decimal system is positional (the positions of the important digits).

When it is said in the singular, for example, millon, it is because there is always a 1 in front, that is,"1 millon" and not "1 millones" (add es for not singular) but if it is greater than 1, it will be for example "2 trillones" = 2000000000000000000 or "320 billones" = 320000000000000.

"mil" is an exception since it does not have a plural, that is, 2 thousand "2 miles" is not used but "2 mil" is placed.

The other exception is that 1 thousand "1 mil" is not written but i write only "mil" and it is understood that it is "1000"

Proto regex for xxx mil xxx

r"d{3}[s|]*(?:mil)[s|]*d{3}"

Proto regex for millon, billon, trillon and cuatrillon

r"d{6}[s|]*(?:cuatrillones|cuatrillon)[s|]*d{6}[s|]*
(?:trillones|trillon)[s|]*d{6}[s|]*(?:billones|billon)[s|:]*d{6}[s|:]*(?:millones|millon)[s|:]*d{6}"

Output that i need obtain with some replacement method like re.sub(), this method is to place some of the regex, since the replacement must be conditioned to be in the middle of this amount of numbers to be done, otherwise it should not be done (as seen in the output of example 2)

"3000000000000320459 47475822"   #example 1

"sumaria 6000000000789000000320000 a esta otra cantidad de elementos  47475822 y eso daría por resultado varios millones o trillones de unidades"   #example 2

How could I improve my regex to be able to perform these replacements correctly? Or maybe it is better to use another method?

Answers:

Going both ways:

import re

NUMBERS = [
    (10**15, 'quatrillon', 'es', False),
    (10**12, 'trillon', 'es', False),
    (10**9, 'billon', 'es', False),
    (10**6, 'millon', 'es', False),
    (10**3, 'mil', '', True)
]


def num_to_name(n):
    n = int(n) if isinstance(n, str) else n

    for size, name, multi, alone in NUMBERS:
        if n > size - 1:
            n = n // size
            if n == 1 and alone:
                return f'{name}'
            else:
                return f'{n} {name}{multi if n > 1 else ""}'
    return str(n)


def name_to_num(s, return_f=False):
    s = s[:-2] if s.endswith('es') else s
    for size, name, _, alone in NUMBERS:
        if s.lower().endswith(name):
            result = int(s[:-(len(name) + 1)]) * size if not alone or s.lower() != name else size
            return (result, size) if return_f else result
    return (int(s), 0) if return_f else int(s)


input_str = "creo que hay 330 trillones 2 billones 18 millones 320 mil 459 47475822 1000"
num_str = re.sub('d+(?: (?:quatr|tr|b|m)illon(?:es)?| mil)?|mil',
                 lambda match: str(name_to_num(match.group(0))), input_str)
print(num_str)

name_str = re.sub('d+',
                  lambda match: num_to_name(match.group(0)), num_str)
print(name_str)

Output:

creo que hay 330000000000000 2000000000 18000000 320000 459 47475822 1000
creo que hay 330 trillones 2 billones 18 millones 320 mil 459 47 millones mil

Note that the final result is not exactly the input string, since the input string had some numbers that could be converted (like '47 millones'). Also, you indicated that 1 mil is written as mil, so an additional field was added to NUMBERS to flag that, and num_to_name() adjusted to deal with that case.

The function num_to_name(n) takes an integer (or string, converted to an integer) and finds the appropriate way to write it as a number, using the naming defined in NUMBERS. If it doesn’t match any of the sizes, it just returns the number as a string.

The function name_to_num(s) takes a string and checks whether it ends in any of the names (with or without plural) defined in NUMBERS. If it does, it tries to convert the rest of the string into an integer and returns that value multiplied by the matching factor. Otherwise, it tries to just return the integer value of the string.

At the bottom, there’s two regexes matching the relevant parts of the input string, using a lambda to replace the found fragments using the 2 functions.

From your comment, I noted that you actually want subsequent matches decreasing in size to be combined into a single number – the below doesn’t answer that, I’ll leave the code all the same)

This additional code does that, together with the first part:

def full_name_to_num(s):
    subs = []
    last_f = 0

    def sub(s):
        s, end = (s[:-1], ' ') if s[-1] == ' ' else (s, '')
        nonlocal last_f
        n, f = name_to_num(s, True)
        if subs and (f < last_f):
            subs[-1] = subs[-1] + n
            result = ''
        else:
            subs.append(n)
            result = str(len(subs)-1) + end
        last_f = f
        return result

    temp = re.sub('(?:d+(?: (?:quatr|tr|b|m)illon(?:es)?| mil)?|mil) ?', lambda match: sub(match.group(0)), s)
    return re.sub('d+', lambda match: str(subs[int(match.group(0))]), temp)


def full_num_to_name(s):
    def sub(s):
        n = int(s)
        result = [str(n % NUMBERS[-1][0])] if n % NUMBERS[-1][0] else []
        for size, _, _, _ in reversed(NUMBERS):
            if (n // size) % 1000:
                result.append(num_to_name(n % (size * 1000)))
        return ' '.join(reversed(result))

    return re.sub('d+', lambda match: sub(match.group(0)), s)


input_str = "creo que hay 330 trillones 2 billones 18 millones 320 mil 459 47475822"
full_num_str = full_name_to_num(input_str)
print(full_num_str)

full_name_str = full_num_to_name(full_num_str)
print(full_name_str)

Extra output:

creo que hay 330002018320459 47475822
creo que hay 330 trillones 2 billones 18 millones 320 mil 459 47 millones 475 mil 822
Answered By: Grismar

I think you shouldn’t use pure regex for that but rather mix some clever arithmetic parsing. This is an example of how to solve it (note that it actually translates the numbers in a way that makes sense and doesn’t just concat them so the results are somewhat different than what you defined as desired)

import re

input_str1 = "creo que hay 330 trillones 2 billones 18 millones 320 mil 459 47475822"
input_str2 = "sumaria 6 cuatrillones 789 billones 320 mil a esta otra cantidad de elementos  47475822 y eso daría por resultado varios millones o trillones de unidades"


def wrap_word(word: str) -> str:
    return fr"(d+)s+b{word}b"


def wrap_num(num: int) -> str:
    return f"\1*{str(num)}"


def eval_mult_exp(text: str) -> str:
    for op1, op2 in re.findall("(\d+)*(\d+)", text):
        text = re.sub(pattern=op1+"*"+op2, repl=str(int(op1)*int(op2)), string=text)
    return text


def eval_addition_exp(text: str) -> str:
    if not re.search("(\d+) (\d+)", text):  # recursion halting condition
        return text

    for op1, op2 in re.findall("(\d+) (\d+)", text):
        text = re.sub(pattern=op1+" "+op2, repl=str(int(op1)+int(op2)), string=text)
    return eval_addition_exp(text)


def word_to_num(word: str) -> str:
    for pattern, numeric_replacement in [
        (wrap_word("mil"), wrap_num(10**3)),
        (wrap_word("millones(es)?"), wrap_num(10**6)),
        (wrap_word("billon(es)?"), wrap_num(10**9)),
        (wrap_word("trillon(es)?"), wrap_num(10**12)),
        (wrap_word("cuatrillon(es)?"), wrap_num(10**15)),
    ]:
        word = re.sub(pattern, numeric_replacement, word)
    return word


print(eval_addition_exp(eval_mult_exp(word_to_num(input_str2))))

Out[1]:

sumaria 6000789000320000 a esta otra cantidad de elementos 47475822 y eso daría por resultado varios millones o trillones de unidades

Excuse my Spanish 🙂