Using regex to extract based on a recurring pattern excluding newline characters

Question:

I have a string as follows:

27223525
 
West Food Group B.V.9
 
52608670
 
Westcon
 
Group European Operations Netherlands Branch
 
30221053
 
Westland Infra Netbeheer B.V.
 
27176688
 
Wetransfer  85 B.V.
 
34380998
 
WETRAVEL B.V.
 
70669783

This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:

[^nd{6,}].+

This almost takes me there as it returns all the company names, however in cases where the company name itself contains a new line character these get returned as two different company names. For instance Westcon is a match and Group European Operations Netherlands Branch is also a match. I would like to tweak the above expression to make sure that the final match is Westcon European Operations Netherlands Branch. What regex concepts should I use to achieve this?

Edit

I tried the following based on the comment below but got the wrong result

text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'

re.findall(r'[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)',text)
Asked By: user32882

||

Answers:

This will create one group for lines that don’t have numbers.

regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g

Demo: https://regex101.com/r/MMLGw6/1

Answered By: Alex G

I think that you only want the company names. If so, this should work.

input = '''27223525

West Food Group B.V.9

52608670

Westcon

Group European Operations Netherlands Branch

30221053

Westland Infra Netbeheer B.V.

27176688

Wetransfer 85 B.V.

34380998

WETRAVEL B.V.

70669783

'''

company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)

pprint(company_name_regex)

['West Food Group B.V.9',
 'Westcon',
 'Group European Operations Netherlands Branch',
 'Westland Infra Netbeheer B.V.',
 'Wetransfer 85 B.V.'
 'WETRAVEL B.V.']
Answered By: Life is complex

If you can solve this without regex it should be solved without regex:

useful = []

for line in text.split():
    if line.strip() and not line.isdigit():
        useful.append(line)

That should work – more or less. Replying from my phone so can’t test.

Answered By: Hugo

Assuming your company names starts with a letter, you may use this regex with re.M modifier:

^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)

RegEx Demo

In python:

regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)

This matches a line that starts with [a-zA-Z] until end of line and then matches more lines separated by n that also start with [a-zA-Z] characters.

(?=n+d{6,}$) is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.

Answered By: anubhava
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.