Extract all matches unless string contains
Question:
I am using the re package’s re.findall
to extract terms from strings. How can I make a regex to say capture these matches unless you see this substring (in this case the substring "fake"
). I attempted this via a anchored look-ahead solution.
Current Output:
import re
for x in ['a man dogs', "fake: too many dogs", 'hi']:
print(re.findall(r"(man[a-z]?b|dog)(?!^.*fake)", x, flags=re.IGNORECASE))
## ['man', 'dog']
## ['many', 'dog']
## []
Desired Output
## ['man', 'dog']
## []
## []
I could accomplish this with an if/else but was wondering how to use a pure regex to solve this?
for x in ['a man dogs', "fake: too many dogs", 'hi']:
if re.search('fake', x, flags=re.IGNORECASE):
print([])
else:
print(re.findall(r"(man[a-z]?b|dog)", x, flags=re.IGNORECASE))
## ['man', 'dog']
## []
## []
Answers:
Since re
does not support unknown length lookbehind patterns, the plain regex solution is not possible. However, the PyPi regex library supports such lookbehind patterns.
After installing PyPi regex, you can use
(?<!fake.*)(man[a-z]?b|dog)(?!.*fake)
See the regex demo.
Details:
(?<!fake.*)
– a negative lookbehind that fails the match if there is fake
string followed with any zero or more chars other than line break chars as many as possible immediately to the left of the current location
(man[a-z]?b|dog)
– man
+ a lowercase ASCII letter followed with a word boundary or dog
string
(?!.*fake)
– a negative lookahead that fails the match if there are any zero or more chars other than line break chars as many as possible and then a fake
string immediately to the left of the current location.
In Python:
import regex
for x in ['a man dogs', "fake: too many dogs", 'hi']:
print(regex.findall(r"(?<!fake.*)(man[a-z]?b|dog)(?!.*fake)", x, flags=re.IGNORECASE))
In your pattern (man[a-z]?b|dog)(?!^.*fake)
the negative lookahead is after the match, but the word fake can still occur before one of the matches.
With Python re
you can get out of the way what you don’t want to keep, and capture what you want to keep using a capture group.
What you could do is not use a negative lookahead, but match a whole line that contains the word fake
^.*bfakeb.*|(man[a-z]?b|dog)
Explanation
^.*bfakeb.*
Match a whole line that contains the word fake
|
Or
(man[a-z]?b|dog)
Capture group 1, match either man
and optional char a-z or match dog
import re
pattern = r"^.*bfakeb.*|(man[a-z]?b|dog)"
for x in ['a man dogs', "fake: too many dogs", 'hi']:
res = [s for s in re.findall(pattern, x, re.IGNORECASE) if s]
print(res)
Output
['man', 'dog']
[]
[]
I am using the re package’s re.findall
to extract terms from strings. How can I make a regex to say capture these matches unless you see this substring (in this case the substring "fake"
). I attempted this via a anchored look-ahead solution.
Current Output:
import re
for x in ['a man dogs', "fake: too many dogs", 'hi']:
print(re.findall(r"(man[a-z]?b|dog)(?!^.*fake)", x, flags=re.IGNORECASE))
## ['man', 'dog']
## ['many', 'dog']
## []
Desired Output
## ['man', 'dog']
## []
## []
I could accomplish this with an if/else but was wondering how to use a pure regex to solve this?
for x in ['a man dogs', "fake: too many dogs", 'hi']:
if re.search('fake', x, flags=re.IGNORECASE):
print([])
else:
print(re.findall(r"(man[a-z]?b|dog)", x, flags=re.IGNORECASE))
## ['man', 'dog']
## []
## []
Since re
does not support unknown length lookbehind patterns, the plain regex solution is not possible. However, the PyPi regex library supports such lookbehind patterns.
After installing PyPi regex, you can use
(?<!fake.*)(man[a-z]?b|dog)(?!.*fake)
See the regex demo.
Details:
(?<!fake.*)
– a negative lookbehind that fails the match if there isfake
string followed with any zero or more chars other than line break chars as many as possible immediately to the left of the current location(man[a-z]?b|dog)
–man
+ a lowercase ASCII letter followed with a word boundary ordog
string(?!.*fake)
– a negative lookahead that fails the match if there are any zero or more chars other than line break chars as many as possible and then afake
string immediately to the left of the current location.
In Python:
import regex
for x in ['a man dogs', "fake: too many dogs", 'hi']:
print(regex.findall(r"(?<!fake.*)(man[a-z]?b|dog)(?!.*fake)", x, flags=re.IGNORECASE))
In your pattern (man[a-z]?b|dog)(?!^.*fake)
the negative lookahead is after the match, but the word fake can still occur before one of the matches.
With Python re
you can get out of the way what you don’t want to keep, and capture what you want to keep using a capture group.
What you could do is not use a negative lookahead, but match a whole line that contains the word fake
^.*bfakeb.*|(man[a-z]?b|dog)
Explanation
^.*bfakeb.*
Match a whole line that contains the wordfake
|
Or(man[a-z]?b|dog)
Capture group 1, match eitherman
and optional char a-z or matchdog
import re
pattern = r"^.*bfakeb.*|(man[a-z]?b|dog)"
for x in ['a man dogs', "fake: too many dogs", 'hi']:
res = [s for s in re.findall(pattern, x, re.IGNORECASE) if s]
print(res)
Output
['man', 'dog']
[]
[]