ValueError: invalid literal for int() with base 10: ''

Question

I try to extract words from a text. So I have this text:

"[' nna)nn nnFa.The Rotterdam District Court shall have exclusive jurisdiction.nnrut ard wegetablesnx0c']"

and I have this method:

def total_fruit_per_sort():
    number_found = re.findall(total_amount_fruit_regex(), verdi47)
    print(number_found)
    fruit_dict = {}
    for n, f in number_found:
        fruit_dict[f] = fruit_dict.get(f, 0) + int(n)
    return {value: key for key, value in fruit_dict.items()}


def total_amount_fruit_regex(format_=re.escape):

    return r"(d*(?:.d+)*)s*(" + '|'.join(format_(word)
                                             for word in fruit_words) + ')'

and the fruit_words:

fruit_words = ['Appels', 'Ananas', 'Peen Waspeen',
               'Tomaten Cherry', 'Sinaasappels',
               'Watermeloenen', 'Rettich', 'Peren', 'Peen', 'Mandarijnen', 'Meloenen', 'Grapefruit']

and then the print returns this:

[('16', 'Watermeloenen'), ('360', 'Watermeloenen'), ('6', 'Watermeloenen'), ('75', 'Watermeloenen'), ('9', 'Watermeloenen'), ('688', 'Appels'), ('22', 'Sinaasappels'), ('80', 'Sinaasappels'), ('160', 'Sinaasappels'), ('320', 'Sinaasappels'), ('160', 'Sinaasappels'), ('61', 'Sinaasappels')]

So this is correct.

But then I have this text:

"['a= (>)nnFan nx0c']"

and it returns this:

[('566', 'Ananas'), ('706', 'Appels'), ('598', 'Peen Waspeen'), ('176', 'Sinaasappels'), ('179', 'Peen Waspeen'), ('222', 'Peen Waspeen'), ('270', 'Peen Waspeen'), ('400', 'Rettich'), ('129', 'Rettich'), ('48', 'Rettich'), ('', 'Rettich'), ('', 'Rettich'), ('', 'Rettich'), ('160', 'Sinaasappels'), ('6', 'Sinaasappels'), ('320', 'Sinaasappels')]

So Rettich has a lot of empty values.

Question. How can I improve this? So that by also the second text all the values will be extracted?

Asked By: mightycode Newton

||

Source

Answer 1

you need to change the regexp to allow an optional = or ~= between the number and fruit.

def total_amount_fruit_regex(format_=re.escape):
    return r"(d*(?:.d+)*)s*(?:=|~=)?s*(" + '|'.join(
        format_(word) for word in fruit_words) + ')'

Answered By: Barmar

ValueError: invalid literal for int() with base 10: ''

Question:

Answers: