How to extract number before specific text with regular expression?

Question:

I try to extract the number before some specific text. So I have this long string:

verdi = "['a= (>)nnFactuurnVerdi Import SchoolfruitnFactuur nr; %: 70273 Koopliedenweg 38nDeb. nr. : 108636 2991 LN BARENDRECHTnYour VAT nr. : NL851703884B01 NederlandnFactuur datum : 19-11-21nAantal Omschrijving Prijs BedragnOrder number : 76372 Loading date : 15-11-21 Incoterm: : FOTnYour ref. : SCHOOLFRUIT Delivery date :nWK46nVerdi Import Schoolfruitn566 Ananas Crownless 14kg 10 Sweet CR Klasse I € 7,00 € 3.962,00n706 Appels Royal Gala 13kg 60/65 Generica PL Klasse I € 4,68 € 3.304,08n598 Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I € 6,30 3.767,40nOrder number : 76462 Loading date : 18-11-21 Incoterm: : FOTnYour ref. : SCHOOLFRUIT Delivery date :nWK47nD.C, Schoolfruitn176 Sinaasappels Valencias 15kg 125 Generica UY Klasse I € 6,25 € 1.100,00n179 Peen Waspeen 14x1kg 200-400 Generica BE Klasse I € 6,30 € 1.127,70n222 Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I € 6,30 € 1.398,60n270 Peen Waspeen 14x1ikg 200-400 Generica BE Klasse I € 6,30 € 1.701,00nZuidn176 sinaasn222 wortelnmiddenn270 wortelnNoordn179 wortelnOrder number : 75674 Loading date : 18-11-21 Incoterm: : FRAnYour ref. : SCHOOLFRUIT Delivery date :nWK47nD.C. Schoolfruitn400 Rettich Klein x20 10kg 20 GENER DE Klasse I € 4,70 € 1.880,00n129 Rettich Klein x20 10kg 20 GENER DE Klasse I € 4,70 € 606,30n48 Rettich Klein x20 10kg 20 GENER IT Klasse I € 4,70 € 225,60n104 = Rettich Klein x20 10kg 20 GENER IT Klasse I € 4,70 € 488,80n22 =Rettich Klein x20 10kg 20 Viva IT Klasse I € 4,70 € 103,40n107 ~=Rettich Klein x20 10kg 20 Viva IT Klasse I € 4,70 € 502,90n160 Sinaasappels Valencias 15kg 125 ALG ZA Klasse I € 7,50 € 1.200,00n6 Sinaasappels Valencias 15kg 125 ALG ZA Klasse I € 7,50 € 45,00n320 Sinaasappels Valencias 15kg 125 FVC ZA Klasse I € 7,50 € 2.400,00nREGIOnSINAASnMIDDEN: 219nNOORD: 267nVerDi Import BV ING Bank N.V. Rotterdam IBAN number: NL17INGB0006959173 aoethenKoopliedenweg 38, 2991 LN Barendrecht, The Netherlands SWIFT/BIC: INGBNL2A, VAT number: NL851703884B01nnanTel. +31 (0)1 80 61 88 11, Fax +31 (0)1 80 61 88 25 Chamber of Commerce Rotterdam no. 55424309 VerDinE-mail: [email protected], www.verdiimport.nl Dutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction.nnfrult and wegetadlesnn nx0c', 'a> >)nnFactuurnVerdi Import SchoolfruitnFactuur nr. : 70273 Koopliedenweg 38nDeb. nr. : 108636 2991 LN BARENDRECHTnYour VAT nr. : NL851703884B01 NederlandnFactuur datum ; 19-11-21nAantal Omschrijving Prijs BedragnRETTICH:nZUID: 216nNOORD: 328nMIDDEN: 266nTotaal Colli Totaal Netto Btw Btw Bedrag Totaal Bedragnn     n nn€ 23.812,78 € 25.955,93nn   nnBetaling binnen 30 dagennAchterstand wordt gemeld bij de kredietverzekeringsmaatschappijnnVerDi Import BV ING Bank N.V. Rotterdam IBAN number: NL17INGBO006959173 =nKoopliedenweg 38, 2991 LN Barendrecht, The Netherlands SWIFT/BIC: INGBNL2A, VAT number: NL851703884B01 7nTel. +31 (0)1 80 61 88 11, Fax +31 (0)1 80 61 88 25 Chamber of Commerce Rotterdam no. 55424309 VerDnE-mail: [email protected], www.verdiimport.nl Dutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction. lnnfrutt and vegetables:nn nx0c']"

and then I want to extract the number before this words:


fruit_words = ['Appels Royal Gala 13kg 60/65 Generica PL Klasse I',
               'Ananas Crownless 14kg 10 Sweet CR Klasse I', 
               'Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I' ]

So I try it like this:

number_foud = re.findall(r"([0-9]+)" .join(fruit_words),verdi)

But if I try to run this with

print(number_foud)

it returns []

question: what I have to change, that it will return the number before tex?

Thank you

for example 222 Appels Royal Gala 13kg 60/65 Generica PL Klasse I

Asked By: mightycode Newton

||

Answers:

You forgot the | in the regular expression.

import re

verdi = "['a= (>)nnFactuurnVerdi Import SchoolfruitnFactuur nr; %: 70273 Koopliedenweg 38nDeb. nr. : 108636 2991 LN BARENDRECHTnYour VAT nr. : NL851703884B01 NederlandnFactuur datum : 19-11-21nAantal Omschrijving Prijs BedragnOrder number : 76372 Loading date : 15-11-21 Incoterm: : FOTnYour ref. : SCHOOLFRUIT Delivery date :nWK46nVerdi Import Schoolfruitn566 Ananas Crownless 14kg 10 Sweet CR Klasse I € 7,00 € 3.962,00n706 Appels Royal Gala 13kg 60/65 Generica PL Klasse I € 4,68 € 3.304,08n598 Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I € 6,30 3.767,40nOrder number : 76462 Loading date : 18-11-21 Incoterm: : FOTnYour ref. : SCHOOLFRUIT Delivery date :nWK47nD.C, Schoolfruitn176 Sinaasappels Valencias 15kg 125 Generica UY Klasse I € 6,25 € 1.100,00n179 Peen Waspeen 14x1kg 200-400 Generica BE Klasse I € 6,30 € 1.127,70n222 Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I € 6,30 € 1.398,60n270 Peen Waspeen 14x1ikg 200-400 Generica BE Klasse I € 6,30 € 1.701,00nZuidn176 sinaasn222 wortelnmiddenn270 wortelnNoordn179 wortelnOrder number : 75674 Loading date : 18-11-21 Incoterm: : FRAnYour ref. : SCHOOLFRUIT Delivery date :nWK47nD.C. Schoolfruitn400 Rettich Klein x20 10kg 20 GENER DE Klasse I € 4,70 € 1.880,00n129 Rettich Klein x20 10kg 20 GENER DE Klasse I € 4,70 € 606,30n48 Rettich Klein x20 10kg 20 GENER IT Klasse I € 4,70 € 225,60n104 = Rettich Klein x20 10kg 20 GENER IT Klasse I € 4,70 € 488,80n22 =Rettich Klein x20 10kg 20 Viva IT Klasse I € 4,70 € 103,40n107 ~=Rettich Klein x20 10kg 20 Viva IT Klasse I € 4,70 € 502,90n160 Sinaasappels Valencias 15kg 125 ALG ZA Klasse I € 7,50 € 1.200,00n6 Sinaasappels Valencias 15kg 125 ALG ZA Klasse I € 7,50 € 45,00n320 Sinaasappels Valencias 15kg 125 FVC ZA Klasse I € 7,50 € 2.400,00nREGIOnSINAASnMIDDEN: 219nNOORD: 267nVerDi Import BV ING Bank N.V. Rotterdam IBAN number: NL17INGB0006959173 aoethenKoopliedenweg 38, 2991 LN Barendrecht, The Netherlands SWIFT/BIC: INGBNL2A, VAT number: NL851703884B01nnanTel. +31 (0)1 80 61 88 11, Fax +31 (0)1 80 61 88 25 Chamber of Commerce Rotterdam no. 55424309 VerDinE-mail: [email protected], www.verdiimport.nl Dutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction.nnfrult and wegetadlesnn nx0c', 'a> >)nnFactuurnVerdi Import SchoolfruitnFactuur nr. : 70273 Koopliedenweg 38nDeb. nr. : 108636 2991 LN BARENDRECHTnYour VAT nr. : NL851703884B01 NederlandnFactuur datum ; 19-11-21nAantal Omschrijving Prijs BedragnRETTICH:nZUID: 216nNOORD: 328nMIDDEN: 266nTotaal Colli Totaal Netto Btw Btw Bedrag Totaal Bedragnn     n nn€ 23.812,78 € 25.955,93nn   nnBetaling binnen 30 dagennAchterstand wordt gemeld bij de kredietverzekeringsmaatschappijnnVerDi Import BV ING Bank N.V. Rotterdam IBAN number: NL17INGBO006959173 =nKoopliedenweg 38, 2991 LN Barendrecht, The Netherlands SWIFT/BIC: INGBNL2A, VAT number: NL851703884B01 7nTel. +31 (0)1 80 61 88 11, Fax +31 (0)1 80 61 88 25 Chamber of Commerce Rotterdam no. 55424309 VerDnE-mail: [email protected], www.verdiimport.nl Dutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction. lnnfrutt and vegetables:nn nx0c']"

fruit_words = ['Appels Royal Gala 13kg 60/65 Generica PL Klasse I',
               'Ananas Crownless 14kg 10 Sweet CR Klasse I',
               'Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I']

regex = r"([0-9]+)s*(" + '|'.join(fruit_words) + ')'
print(regex)

numbers_found = re.findall(regex, verdi)
print(numbers_found)

The regex is

([0-9]+)s*(Appels Royal Gala 13kg 60/65 Generica PL Klasse I|Ananas Crownless 14kg 10 Sweet CR Klasse I|Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I)

and the result is

[('566', 'Ananas Crownless 14kg 10 Sweet CR Klasse I'), 
 ('706', 'Appels Royal Gala 13kg 60/65 Generica PL Klasse I'), 
 ('598', 'Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I'), 
 ('222', 'Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I')]

If fruit_words could contain characters with a special meaning for the regular expression, you should escape the words:

regex = r"([0-9]+)s*(" + '|'.join(re.escape(word) for word in fruit_words) + ')'

And if you’re not interested in the text belonging to that number you can ignore the group with ?:.

regex = r"([0-9]+)s*(?:" + '|'.join(re.escape(word) for word in fruit_words) + ')'

As noted in the comments the regular expression doesn’t find some values. The reason is simple: the texts don’t match. We’re looking for "14x1lkg", but the texts are "14x1ikg" and "14x1kg".

If we change 'Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I' in the fruit list to 'Peen Waspeen 14x1.?kg 200-400 Generica BE Klasse I' and construct the regex with r"([0-9]+)s*(" + '|'.join(fruit_words) + ')' the result is

[('566', 'Ananas Crownless 14kg 10 Sweet CR Klasse I'),
 ('706', 'Appels Royal Gala 13kg 60/65 Generica PL Klasse I'),
 ('598', 'Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I'),
 ('179', 'Peen Waspeen 14x1kg 200-400 Generica BE Klasse I'),
 ('222', 'Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I'),
 ('270', 'Peen Waspeen 14x1ikg 200-400 Generica BE Klasse I')]

Caveat: Since we use .? in the text now we can no longer use re.escape.

Answered By: Matthias

This is my previous post, it might be helpful for your need. Click the link below
Take a Look on it [REGEX]

Answered By: Jeevan ebi
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.