How to filter items?

Question:

I have a method for extracting some values from a text string. But the ordering is now via how many times the word in the list occurrences. And I want to have the ordering via first occurence in text string.

So this is the text:

text = """['

E-mail: [email protected], www.verdiimport.nl Dutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction.

rut ard wegetables

']"""

and this is the filtermethod:

def total_fruit_cost(file_name):
    fruit_cost_found = []
    single_fruit = [fruit for fruit in fruit_words]
    #print(single_fruit)
    for fruit in single_fruit:
        m = re.findall(regex_fruit_cost(fruit), file_name)
        if m:
            fruit_cost_found.append(m)   
    return  next(list(item for sublist in fruit_cost_found for item in sublist))

and the regex_fruit_cost:

def regex_fruit_cost(subst):
    return r"(?<=" + subst + r").*?(?P<number>[0-9,.]*)n"

and the list of fruit_words:

fruit_words = ['Appels', 'Ananas', 'Peen Waspeen',
               'Tomaten Cherry', 'Sinaasappels',
               'Watermeloenen', 'Rettich', 'Peren', 'Peen', 'Mandarijnen', 'Meloenen', 'Grapefruit']

So the output is now like this:

['3.488,16', '137,50', '500,00', '1.000,00', '2.000,00', '1.000,00', '381,25', '123,20', '2.772,00', '46,20', '577,50', '69,30']

But it has to be by first occurence: 123.20, 2772,00, 46,20, etc..381,25 because so it occurs in the text string

So that the ordering will be as first occurence, second occurence, etc in the text string.

My question is: what I have to change?

So if you take this text string:

verdi49 = "['VernnFactuurnVerdi Import SchoolfruitnFactuur nr; ¢ 71215 Koopliedenweg 38nDeb. nr, : 108636 2991 LN BARENDRECHTnYour VAT nr. : NL851703884B01 NederlandnFactuur datum : 13-12-21nAantal Omschrijving Prijs BedragnOrdernumber’ : 77150 Loading date 02-12-21 Incoterm: : FOTnYour ref. : SCHOOLFRUIT Delivery datenWK49nD.C. Schoolfruitn612 Peen Breek peen 10x1kg B Rabbit NLI € 4,/0 € 2.876,40n688 Appels Royal Gala 13kg 60/65 Generica PL I € 4,87 € 3.350,56n320 Sinaasappels Valencias 15kg 105 FVC ZAI € 6,25 € 2.000,00n400 Sinaasappels Valencias 15kg 105 FVC ZAI € 6,25 € 2.500,00n74 Sinaasappels Valencias 15kg 105 FVC ZAI € 6,25 € 462,50nMidden Zuid NoordnVerDi Wortel 202 164 246 612nVerDi Sinaas 262 212 320 794nn nnTotaal Collinn nnGAT — 7nnoe TUNUMMER 4 |nn   nnTotaal Bedragnn€ 12.196,51nn nnBetaling binnen 30 dagennnAchterstand wordt gemeld bij de kredietverzekeringsmaatschappijnnVerDi Import BVnnKoopliedenweg 38, 2991 LN Barendrecht, The NetherlandsnTel. +31 (0)1 80 61 88 11, Fax +31 (0)1 80 61 88 25nnE-mail: [email protected], www.verdiimport.nlnnING Bank N.V, Rotterdam IBAN number: NL17INGBO006959173nSWIFT/BIC: INGBNL2A, VAT number: NL851703884B01nChamber of Commerce Rotterdam no. 55424309nnOutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction.nfrt ard vegetablesnn nx0c']"

it also takes this number in the return: 12.196,51. This number has not be included.

So the regex have to be combined with the list of fruit_words

I try it like this:

def regex_fruit_cost():
    return r"(?<=" + '|'.join(re.escape(word) for word in fruit_words) + ')' + r").*?(?P<number>[0-9,.]*)n"

But then I get this error:

 raise source.error("unbalanced parenthesis")
re.error: unbalanced parenthesis at position 126

Thank you of course. But I also tested for this string:

verdi9_1 = "['a>)nnFactuurnVerdi Import SchoolfruitnFactuur nr. : 74658 Koopliedenweg 38nDeb. nr. : 108636 2991 LN BARENDRECHTnYour VAT nr. : NL851703884B01 NederlandnFactuur datum : 24-02-22nAantal Omschrijving Prijs BedragnOrder number : 81305 Loading date : 24-02-22 Incoterm: : FRAnYour ref. : SCHOOLFRUIT Delivery datenWwkKO9nD.C. Schoolfruitn262 Peren Conference 12kg 55/60 GENER NL II € 5,28 € 1.383,36n120 Grapefruit Rio Red 14kg 35-OT Tekasya TR I € 10,50 € 1.260,00n28 Grapefruit Rio Red 14kg 36-OT Tuval TRI € 10,50 € 294,00n39 Grapefruit Rio Red 14kg 36-OT Tuval TRI € 10,50 € 409,50n55 Grapefruit Rio Red 14kg 36-OT Tuval TRI € 10,50 € 577,50n287 Appels Royal Gala 13kg 60/65 Generica PL I € 5,72 € 1.641,64nTotaal Colli Totaal Netto Btw Btw Bedrag Totaal Bedragn791 € 5.566,00 € 6.066,94nn nnBetaling binnen 30 dagennAchterstand wordt gemeld bij de kredietverzekeringsmaatschappijnnING Bank N.V. Rotterdam IBAN number: NL17INGB0006959173nSWIFT/BIC: INGBNL2A, VAT number: NL851703884B01nnanChamber of Commerce Rotterdam no. 55424309, VerDinDutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction.nnfruit and vegetablesnx0c']"

and then the output is:

[('Peren', ''), ('Grapefruit', ''), ('Grapefruit', ''), ('Appels', '')]
Asked By: mightycode Newton

||

Answers:

import re

def regex_fruit_cost(subst):
    return rf"(?:{subst}).*?(?P<number>[0-9,.]*)n"

fruits_groups = (f"(?:{fruit})" for fruit in fruit_words)
fruits_combined_with_capture = f'({"|".join(fruits_groups)})'
fruits_pattern = regex_fruit_cost(fruits_combined_with_capture)
print(re.findall(fruits_pattern,text))

You were on the right track but instead of going through the fruits and finding them in order, you can use them in the regex pattern using the | character. Try printing the variables I used what strings they produce.

Edit: it also works for your verdi49 text

Outputs:

[('Watermeloenen', '123,20'), ('Watermeloenen', '2.772,00'), ('Watermeloenen', '46,20'), ('Watermeloenen', '577,50'), ('Watermeloenen', '69,30'), ('Appels', '3.488,16'), ('Sinaasappels', '137,50'), ('Sinaasappels', '500,00'), ('Sinaasappels', '1.000,00'), ('Sinaasappels', '2.000,00'), ('Sinaasappels', '1.000,00'), ('Sinaasappels', '381,25')]
Answered By: Gábor Fekete
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.