How to identify a set of nearly-identical sentences while excluding sentences containing a specified word?
Question:
I am trying to create a regex that will identify sentences that are structured as follows: The sentence begins with "I require", followed by any number of random words; the sentences ends with "to disclose the information." If the sentence contains the word "refuse", the regex rejects the sentence as not fitting the pattern.
When applied to the following sentences, this is how the regex will return:
- I require the creepy caterpillar to disclose the information. –TRUE
- I require the giant bug to refuse to disclose the information. –FALSE
- I require the dusty moth to disclose the information. –TRUE
- You must not refuse. –FALSE
Here’s what I have tried:
I can get 3 out of the 4 example sentences correct by writing ^(?:(?!refuse)(.))+$
I can get 2 out of the 4 example sentences correct by writing I require [sw]+ to disclose the information.
I can get 2 of the 4 example sentences correct by writing ^(?:(?!refuse)(I require [sw]+ to disclose the information.))$
Edit: This question differs from the one at Regex Multiple Conditions because that question is dealing with two relatively simple truth conditions; this question involves a complex truth condition in the form of a sentence with variables in the middle of it. The answer at at Regex Multiple Conditions however could be considered a duplicate because it also contains the piece of information I was missing, which is: the negative lookahead needed a wildcard.
Answers:
You can use
r'^(?!.*brefuseb)I requireb[ws]*bto disclose the information.$'
This regular expression can be broken down as follows.
^ Match beginning of string
(?! Begin positive lookahead
.* Match zero or more characters other
than line terminators
brefuseb Match literal surriounded by word boundaries
) End negative lookahead
I requireb Match literal followed by word boundary
[ws]* Match zero or more word characters or
whitespace characters
bto disclose the information. Match literal preceded by word boundary
$ Match end of string
I am trying to create a regex that will identify sentences that are structured as follows: The sentence begins with "I require", followed by any number of random words; the sentences ends with "to disclose the information." If the sentence contains the word "refuse", the regex rejects the sentence as not fitting the pattern.
When applied to the following sentences, this is how the regex will return:
- I require the creepy caterpillar to disclose the information. –TRUE
- I require the giant bug to refuse to disclose the information. –FALSE
- I require the dusty moth to disclose the information. –TRUE
- You must not refuse. –FALSE
Here’s what I have tried:
I can get 3 out of the 4 example sentences correct by writing ^(?:(?!refuse)(.))+$
I can get 2 out of the 4 example sentences correct by writing I require [sw]+ to disclose the information.
I can get 2 of the 4 example sentences correct by writing ^(?:(?!refuse)(I require [sw]+ to disclose the information.))$
Edit: This question differs from the one at Regex Multiple Conditions because that question is dealing with two relatively simple truth conditions; this question involves a complex truth condition in the form of a sentence with variables in the middle of it. The answer at at Regex Multiple Conditions however could be considered a duplicate because it also contains the piece of information I was missing, which is: the negative lookahead needed a wildcard.
You can use
r'^(?!.*brefuseb)I requireb[ws]*bto disclose the information.$'
This regular expression can be broken down as follows.
^ Match beginning of string
(?! Begin positive lookahead
.* Match zero or more characters other
than line terminators
brefuseb Match literal surriounded by word boundaries
) End negative lookahead
I requireb Match literal followed by word boundary
[ws]* Match zero or more word characters or
whitespace characters
bto disclose the information. Match literal preceded by word boundary
$ Match end of string