How to filter items?
Question:
I have a method for extracting some values from a text string. But the ordering is now via how many times the word in the list occurrences. And I want to have the ordering via first occurence in text string.
So this is the text:
text = """['
E-mail: [email protected], www.verdiimport.nl Dutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction.
rut ard wegetables
']"""
and this is the filtermethod:
def total_fruit_cost(file_name):
fruit_cost_found = []
single_fruit = [fruit for fruit in fruit_words]
#print(single_fruit)
for fruit in single_fruit:
m = re.findall(regex_fruit_cost(fruit), file_name)
if m:
fruit_cost_found.append(m)
return next(list(item for sublist in fruit_cost_found for item in sublist))
and the regex_fruit_cost:
def regex_fruit_cost(subst):
return r"(?<=" + subst + r").*?(?P<number>[0-9,.]*)n"
and the list of fruit_words:
fruit_words = ['Appels', 'Ananas', 'Peen Waspeen',
'Tomaten Cherry', 'Sinaasappels',
'Watermeloenen', 'Rettich', 'Peren', 'Peen', 'Mandarijnen', 'Meloenen', 'Grapefruit']
So the output is now like this:
['3.488,16', '137,50', '500,00', '1.000,00', '2.000,00', '1.000,00', '381,25', '123,20', '2.772,00', '46,20', '577,50', '69,30']
But it has to be by first occurence: 123.20, 2772,00, 46,20, etc..381,25
because so it occurs in the text string
So that the ordering will be as first occurence, second occurence, etc in the text string.
My question is: what I have to change?
So if you take this text string:
verdi49 = "['VernnFactuurnVerdi Import SchoolfruitnFactuur nr; ¢ 71215 Koopliedenweg 38nDeb. nr, : 108636 2991 LN BARENDRECHTnYour VAT nr. : NL851703884B01 NederlandnFactuur datum : 13-12-21nAantal Omschrijving Prijs BedragnOrdernumber’ : 77150 Loading date 02-12-21 Incoterm: : FOTnYour ref. : SCHOOLFRUIT Delivery datenWK49nD.C. Schoolfruitn612 Peen Breek peen 10x1kg B Rabbit NLI € 4,/0 € 2.876,40n688 Appels Royal Gala 13kg 60/65 Generica PL I € 4,87 € 3.350,56n320 Sinaasappels Valencias 15kg 105 FVC ZAI € 6,25 € 2.000,00n400 Sinaasappels Valencias 15kg 105 FVC ZAI € 6,25 € 2.500,00n74 Sinaasappels Valencias 15kg 105 FVC ZAI € 6,25 € 462,50nMidden Zuid NoordnVerDi Wortel 202 164 246 612nVerDi Sinaas 262 212 320 794nn nnTotaal Collinn nnGAT — 7nnoe TUNUMMER 4 |nn nnTotaal Bedragnn€ 12.196,51nn nnBetaling binnen 30 dagennnAchterstand wordt gemeld bij de kredietverzekeringsmaatschappijnnVerDi Import BVnnKoopliedenweg 38, 2991 LN Barendrecht, The NetherlandsnTel. +31 (0)1 80 61 88 11, Fax +31 (0)1 80 61 88 25nnE-mail: [email protected], www.verdiimport.nlnnING Bank N.V, Rotterdam IBAN number: NL17INGBO006959173nSWIFT/BIC: INGBNL2A, VAT number: NL851703884B01nChamber of Commerce Rotterdam no. 55424309nnOutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction.nfrt ard vegetablesnn nx0c']"
it also takes this number in the return: 12.196,51. This number has not be included.
So the regex have to be combined with the list of fruit_words
I try it like this:
def regex_fruit_cost():
return r"(?<=" + '|'.join(re.escape(word) for word in fruit_words) + ')' + r").*?(?P<number>[0-9,.]*)n"
But then I get this error:
raise source.error("unbalanced parenthesis")
re.error: unbalanced parenthesis at position 126
Thank you of course. But I also tested for this string:
verdi9_1 = "['a>)nnFactuurnVerdi Import SchoolfruitnFactuur nr. : 74658 Koopliedenweg 38nDeb. nr. : 108636 2991 LN BARENDRECHTnYour VAT nr. : NL851703884B01 NederlandnFactuur datum : 24-02-22nAantal Omschrijving Prijs BedragnOrder number : 81305 Loading date : 24-02-22 Incoterm: : FRAnYour ref. : SCHOOLFRUIT Delivery datenWwkKO9nD.C. Schoolfruitn262 Peren Conference 12kg 55/60 GENER NL II € 5,28 € 1.383,36n120 Grapefruit Rio Red 14kg 35-OT Tekasya TR I € 10,50 € 1.260,00n28 Grapefruit Rio Red 14kg 36-OT Tuval TRI € 10,50 € 294,00n39 Grapefruit Rio Red 14kg 36-OT Tuval TRI € 10,50 € 409,50n55 Grapefruit Rio Red 14kg 36-OT Tuval TRI € 10,50 € 577,50n287 Appels Royal Gala 13kg 60/65 Generica PL I € 5,72 € 1.641,64nTotaal Colli Totaal Netto Btw Btw Bedrag Totaal Bedragn791 € 5.566,00 € 6.066,94nn nnBetaling binnen 30 dagennAchterstand wordt gemeld bij de kredietverzekeringsmaatschappijnnING Bank N.V. Rotterdam IBAN number: NL17INGB0006959173nSWIFT/BIC: INGBNL2A, VAT number: NL851703884B01nnanChamber of Commerce Rotterdam no. 55424309, VerDinDutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction.nnfruit and vegetablesnx0c']"
and then the output is:
[('Peren', ''), ('Grapefruit', ''), ('Grapefruit', ''), ('Appels', '')]
Answers:
import re
def regex_fruit_cost(subst):
return rf"(?:{subst}).*?(?P<number>[0-9,.]*)n"
fruits_groups = (f"(?:{fruit})" for fruit in fruit_words)
fruits_combined_with_capture = f'({"|".join(fruits_groups)})'
fruits_pattern = regex_fruit_cost(fruits_combined_with_capture)
print(re.findall(fruits_pattern,text))
You were on the right track but instead of going through the fruits and finding them in order, you can use them in the regex pattern using the |
character. Try printing the variables I used what strings they produce.
Edit: it also works for your verdi49
text
Outputs:
[('Watermeloenen', '123,20'), ('Watermeloenen', '2.772,00'), ('Watermeloenen', '46,20'), ('Watermeloenen', '577,50'), ('Watermeloenen', '69,30'), ('Appels', '3.488,16'), ('Sinaasappels', '137,50'), ('Sinaasappels', '500,00'), ('Sinaasappels', '1.000,00'), ('Sinaasappels', '2.000,00'), ('Sinaasappels', '1.000,00'), ('Sinaasappels', '381,25')]
I have a method for extracting some values from a text string. But the ordering is now via how many times the word in the list occurrences. And I want to have the ordering via first occurence in text string.
So this is the text:
text = """['
E-mail: [email protected], www.verdiimport.nl Dutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction.
rut ard wegetables
']"""
and this is the filtermethod:
def total_fruit_cost(file_name):
fruit_cost_found = []
single_fruit = [fruit for fruit in fruit_words]
#print(single_fruit)
for fruit in single_fruit:
m = re.findall(regex_fruit_cost(fruit), file_name)
if m:
fruit_cost_found.append(m)
return next(list(item for sublist in fruit_cost_found for item in sublist))
and the regex_fruit_cost:
def regex_fruit_cost(subst):
return r"(?<=" + subst + r").*?(?P<number>[0-9,.]*)n"
and the list of fruit_words:
fruit_words = ['Appels', 'Ananas', 'Peen Waspeen',
'Tomaten Cherry', 'Sinaasappels',
'Watermeloenen', 'Rettich', 'Peren', 'Peen', 'Mandarijnen', 'Meloenen', 'Grapefruit']
So the output is now like this:
['3.488,16', '137,50', '500,00', '1.000,00', '2.000,00', '1.000,00', '381,25', '123,20', '2.772,00', '46,20', '577,50', '69,30']
But it has to be by first occurence: 123.20, 2772,00, 46,20, etc..381,25
because so it occurs in the text string
So that the ordering will be as first occurence, second occurence, etc in the text string.
My question is: what I have to change?
So if you take this text string:
verdi49 = "['VernnFactuurnVerdi Import SchoolfruitnFactuur nr; ¢ 71215 Koopliedenweg 38nDeb. nr, : 108636 2991 LN BARENDRECHTnYour VAT nr. : NL851703884B01 NederlandnFactuur datum : 13-12-21nAantal Omschrijving Prijs BedragnOrdernumber’ : 77150 Loading date 02-12-21 Incoterm: : FOTnYour ref. : SCHOOLFRUIT Delivery datenWK49nD.C. Schoolfruitn612 Peen Breek peen 10x1kg B Rabbit NLI € 4,/0 € 2.876,40n688 Appels Royal Gala 13kg 60/65 Generica PL I € 4,87 € 3.350,56n320 Sinaasappels Valencias 15kg 105 FVC ZAI € 6,25 € 2.000,00n400 Sinaasappels Valencias 15kg 105 FVC ZAI € 6,25 € 2.500,00n74 Sinaasappels Valencias 15kg 105 FVC ZAI € 6,25 € 462,50nMidden Zuid NoordnVerDi Wortel 202 164 246 612nVerDi Sinaas 262 212 320 794nn nnTotaal Collinn nnGAT — 7nnoe TUNUMMER 4 |nn nnTotaal Bedragnn€ 12.196,51nn nnBetaling binnen 30 dagennnAchterstand wordt gemeld bij de kredietverzekeringsmaatschappijnnVerDi Import BVnnKoopliedenweg 38, 2991 LN Barendrecht, The NetherlandsnTel. +31 (0)1 80 61 88 11, Fax +31 (0)1 80 61 88 25nnE-mail: [email protected], www.verdiimport.nlnnING Bank N.V, Rotterdam IBAN number: NL17INGBO006959173nSWIFT/BIC: INGBNL2A, VAT number: NL851703884B01nChamber of Commerce Rotterdam no. 55424309nnOutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction.nfrt ard vegetablesnn nx0c']"
it also takes this number in the return: 12.196,51. This number has not be included.
So the regex have to be combined with the list of fruit_words
I try it like this:
def regex_fruit_cost():
return r"(?<=" + '|'.join(re.escape(word) for word in fruit_words) + ')' + r").*?(?P<number>[0-9,.]*)n"
But then I get this error:
raise source.error("unbalanced parenthesis")
re.error: unbalanced parenthesis at position 126
Thank you of course. But I also tested for this string:
verdi9_1 = "['a>)nnFactuurnVerdi Import SchoolfruitnFactuur nr. : 74658 Koopliedenweg 38nDeb. nr. : 108636 2991 LN BARENDRECHTnYour VAT nr. : NL851703884B01 NederlandnFactuur datum : 24-02-22nAantal Omschrijving Prijs BedragnOrder number : 81305 Loading date : 24-02-22 Incoterm: : FRAnYour ref. : SCHOOLFRUIT Delivery datenWwkKO9nD.C. Schoolfruitn262 Peren Conference 12kg 55/60 GENER NL II € 5,28 € 1.383,36n120 Grapefruit Rio Red 14kg 35-OT Tekasya TR I € 10,50 € 1.260,00n28 Grapefruit Rio Red 14kg 36-OT Tuval TRI € 10,50 € 294,00n39 Grapefruit Rio Red 14kg 36-OT Tuval TRI € 10,50 € 409,50n55 Grapefruit Rio Red 14kg 36-OT Tuval TRI € 10,50 € 577,50n287 Appels Royal Gala 13kg 60/65 Generica PL I € 5,72 € 1.641,64nTotaal Colli Totaal Netto Btw Btw Bedrag Totaal Bedragn791 € 5.566,00 € 6.066,94nn nnBetaling binnen 30 dagennAchterstand wordt gemeld bij de kredietverzekeringsmaatschappijnnING Bank N.V. Rotterdam IBAN number: NL17INGB0006959173nSWIFT/BIC: INGBNL2A, VAT number: NL851703884B01nnanChamber of Commerce Rotterdam no. 55424309, VerDinDutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction.nnfruit and vegetablesnx0c']"
and then the output is:
[('Peren', ''), ('Grapefruit', ''), ('Grapefruit', ''), ('Appels', '')]
import re
def regex_fruit_cost(subst):
return rf"(?:{subst}).*?(?P<number>[0-9,.]*)n"
fruits_groups = (f"(?:{fruit})" for fruit in fruit_words)
fruits_combined_with_capture = f'({"|".join(fruits_groups)})'
fruits_pattern = regex_fruit_cost(fruits_combined_with_capture)
print(re.findall(fruits_pattern,text))
You were on the right track but instead of going through the fruits and finding them in order, you can use them in the regex pattern using the |
character. Try printing the variables I used what strings they produce.
Edit: it also works for your verdi49
text
Outputs:
[('Watermeloenen', '123,20'), ('Watermeloenen', '2.772,00'), ('Watermeloenen', '46,20'), ('Watermeloenen', '577,50'), ('Watermeloenen', '69,30'), ('Appels', '3.488,16'), ('Sinaasappels', '137,50'), ('Sinaasappels', '500,00'), ('Sinaasappels', '1.000,00'), ('Sinaasappels', '2.000,00'), ('Sinaasappels', '1.000,00'), ('Sinaasappels', '381,25')]