ValueError: invalid literal for int() with base 10: ''

Question:

I try to extract words from a text. So I have this text:

"[' nna)nn nnFa.The Rotterdam District Court shall have exclusive jurisdiction.nnrut ard wegetablesnx0c']"

and I have this method:

def total_fruit_per_sort():
    number_found = re.findall(total_amount_fruit_regex(), verdi47)
    print(number_found)
    fruit_dict = {}
    for n, f in number_found:
        fruit_dict[f] = fruit_dict.get(f, 0) + int(n)
    return {value: key for key, value in fruit_dict.items()}

def total_amount_fruit_regex(format_=re.escape):

    return r"(d*(?:.d+)*)s*(" + '|'.join(format_(word)
                                             for word in fruit_words) + ')'

and the fruit_words:

fruit_words = ['Appels', 'Ananas', 'Peen Waspeen',
               'Tomaten Cherry', 'Sinaasappels',
               'Watermeloenen', 'Rettich', 'Peren', 'Peen', 'Mandarijnen', 'Meloenen', 'Grapefruit']

and then the print returns this:

[('16', 'Watermeloenen'), ('360', 'Watermeloenen'), ('6', 'Watermeloenen'), ('75', 'Watermeloenen'), ('9', 'Watermeloenen'), ('688', 'Appels'), ('22', 'Sinaasappels'), ('80', 'Sinaasappels'), ('160', 'Sinaasappels'), ('320', 'Sinaasappels'), ('160', 'Sinaasappels'), ('61', 'Sinaasappels')]

So this is correct.

But then I have this text:

"['a= (>)nnFan nx0c']"

and it returns this:

[('566', 'Ananas'), ('706', 'Appels'), ('598', 'Peen Waspeen'), ('176', 'Sinaasappels'), ('179', 'Peen Waspeen'), ('222', 'Peen Waspeen'), ('270', 'Peen Waspeen'), ('400', 'Rettich'), ('129', 'Rettich'), ('48', 'Rettich'), ('', 'Rettich'), ('', 'Rettich'), ('', 'Rettich'), ('160', 'Sinaasappels'), ('6', 'Sinaasappels'), ('320', 'Sinaasappels')]

So Rettich has a lot of empty values.

Question. How can I improve this? So that by also the second text all the values will be extracted?

Asked By: mightycode Newton

||

Answers:

you need to change the regexp to allow an optional = or ~= between the number and fruit.

def total_amount_fruit_regex(format_=re.escape):
    return r"(d*(?:.d+)*)s*(?:=|~=)?s*(" + '|'.join(
        format_(word) for word in fruit_words) + ')'
Answered By: Barmar
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.