Using regex to extract based on a recurring pattern excluding newline characters
Question:
I have a string as follows:
27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:
[^nd{6,}].+
This almost takes me there as it returns all the company names, however in cases where the company name itself contains a new line character these get returned as two different company names. For instance Westcon
is a match and Group European Operations Netherlands Branch
is also a match. I would like to tweak the above expression to make sure that the final match is Westcon European Operations Netherlands Branch
. What regex concepts should I use to achieve this?
Edit
I tried the following based on the comment below but got the wrong result
text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'
re.findall(r'[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)',text)
Answers:
This will create one group for lines that don’t have numbers.
regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g
I think that you only want the company names. If so, this should work.
input = '''27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
'''
company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)
pprint(company_name_regex)
['West Food Group B.V.9',
'Westcon',
'Group European Operations Netherlands Branch',
'Westland Infra Netbeheer B.V.',
'Wetransfer 85 B.V.'
'WETRAVEL B.V.']
If you can solve this without regex it should be solved without regex:
useful = []
for line in text.split():
if line.strip() and not line.isdigit():
useful.append(line)
That should work – more or less. Replying from my phone so can’t test.
Assuming your company names starts with a letter, you may use this regex with re.M
modifier:
^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)
In python:
regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)
This matches a line that starts with [a-zA-Z]
until end of line and then matches more lines separated by n
that also start with [a-zA-Z]
characters.
(?=n+d{6,}$)
is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.
I have a string as follows:
27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:
[^nd{6,}].+
This almost takes me there as it returns all the company names, however in cases where the company name itself contains a new line character these get returned as two different company names. For instance Westcon
is a match and Group European Operations Netherlands Branch
is also a match. I would like to tweak the above expression to make sure that the final match is Westcon European Operations Netherlands Branch
. What regex concepts should I use to achieve this?
Edit
I tried the following based on the comment below but got the wrong result
text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'
re.findall(r'[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)',text)
This will create one group for lines that don’t have numbers.
regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g
I think that you only want the company names. If so, this should work.
input = '''27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
'''
company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)
pprint(company_name_regex)
['West Food Group B.V.9',
'Westcon',
'Group European Operations Netherlands Branch',
'Westland Infra Netbeheer B.V.',
'Wetransfer 85 B.V.'
'WETRAVEL B.V.']
If you can solve this without regex it should be solved without regex:
useful = []
for line in text.split():
if line.strip() and not line.isdigit():
useful.append(line)
That should work – more or less. Replying from my phone so can’t test.
Assuming your company names starts with a letter, you may use this regex with re.M
modifier:
^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)
In python:
regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)
This matches a line that starts with [a-zA-Z]
until end of line and then matches more lines separated by n
that also start with [a-zA-Z]
characters.
(?=n+d{6,}$)
is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.