How to identify a set of nearly-identical sentences while excluding sentences containing a specified word?

Question:

I am trying to create a regex that will identify sentences that are structured as follows: The sentence begins with "I require", followed by any number of random words; the sentences ends with "to disclose the information." If the sentence contains the word "refuse", the regex rejects the sentence as not fitting the pattern.

When applied to the following sentences, this is how the regex will return:

  • I require the creepy caterpillar to disclose the information. –TRUE
  • I require the giant bug to refuse to disclose the information. –FALSE
  • I require the dusty moth to disclose the information. –TRUE
  • You must not refuse. –FALSE

Here’s what I have tried:

I can get 3 out of the 4 example sentences correct by writing ^(?:(?!refuse)(.))+$

I can get 2 out of the 4 example sentences correct by writing I require [sw]+ to disclose the information.

I can get 2 of the 4 example sentences correct by writing ^(?:(?!refuse)(I require [sw]+ to disclose the information.))$

Edit: This question differs from the one at Regex Multiple Conditions because that question is dealing with two relatively simple truth conditions; this question involves a complex truth condition in the form of a sentence with variables in the middle of it. The answer at at Regex Multiple Conditions however could be considered a duplicate because it also contains the piece of information I was missing, which is: the negative lookahead needed a wildcard.

Asked By: oymonk

||

Answers:

You can use

r'^(?!.*brefuseb)I requireb[ws]*bto disclose the information.$'

Demo


This regular expression can be broken down as follows.

^                                Match beginning of string
(?!                              Begin positive lookahead
  .*                             Match zero or more characters other
                                 than line terminators
  brefuseb                     Match literal surriounded by word boundaries
)                                End negative lookahead
I requireb                      Match literal followed by word boundary
[ws]*                          Match zero or more word characters or
                                 whitespace characters
bto disclose the information.  Match literal preceded by word boundary
$                                Match end of string
Answered By: Cary Swoveland
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.