Regex – grouping the titles into a standardized form

Question

I am completely new to regex and would appreciate if someone could help me out here. 🙂

I have an input text that consists of headings followed by few lines. I wish to group the headings and corresponding content that comes under each heading in 2 separate arrays (or as 2 columns in a dataframe).

Example:

the input text :

Inclusion Criteria for all fruit lovers:

extract this line 2

extract this line 3 as well

Exclusion Criteria for all fruit lovers:

extract this exclusion line 2

extract this exclusion line 3 as well

Inclusion Criteria for apple lovers:

extract this line

extract this line as well

Exclusion Criteria for apple lovers:

extract this line

extract this line as well

the inclusion criteria for both apple and orange lovers

extract this exclusion line 2

extract this exclusion line 3 as well

the exclusion criteria for both apple and orange lovers

extract this exclusion line 2

extract this exclusion line 3 as well

desired output : all the content that comes under inclusion criteria keyword in the title should be grouped together under Inclusion Criteria, similarly all the content that comes user keyword exclusion criteria in title should come under Exclusion Criteria

[Inclusion Criteria :
extract this line 2 extract this line 3 as well
…
…
..
]

[Exclusion Criteria:
extract this exclusion line 2
extract this exclusion line 3 as well
…..
….
..]

Regex I tried forming: Inclusion Criterias*(.*?)s*Exclusion Criteria|Inclusion Criterias*(.*)(nn).*$

Asked By: Angie

||

Source

Answer 1

Not the best solution but will do for your case(not regex)

data = '''Inclusion Criteria for all fruit lovers:
extract this line 2
extract this line 3 as well
Exclusion Criteria for all fruit lovers:
extract this exclusion line 2
extract this exclusion line 3 as well
Inclusion Criteria for apple lovers:
extract this line
extract this line as well
Exclusion Criteria for apple lovers:
extract this line
extract this line as well
the inclusion criteria for both apple and orange lovers
extract this exclusion line 2
extract this exclusion line 3 as well
the exclusion criteria for both apple and orange lovers
extract this exclusion line 2
extract this exclusion line 3 as well'''
newline_split = data.split('n')
space_removal = [i for i in newline_split if i.strip()]
keywords = ['Inclusion Criteria', 'Exclusion Criteria', 'inclusion criteria',
        'exclusion criteria']
get_index_inclusion_exclusion = [space_removal.index(i) for i in space_removal
                             if any((j in i) for j in keywords)]
start_index = get_index_inclusion_exclusion[0::2]  # inclusion index
stop_index = get_index_inclusion_exclusion[1::2]  # exclusion index
inclusion_line = []
exclusion_line = []
if len(start_index) > len(stop_index):
   maxi_len = len(start_index)
if len(start_index) < len(stop_index):
   maxi_len = len(stop_index)
else:
   maxi_len = len(start_index)
for i in range(maxi_len):
   if len(start_index) > len(stop_index):
       try:
          inclusion_text = space_removal[start_index[i] + 1:stop_index[i]]
       except IndexError:
           inclusion_text = space_removal[start_index[i] + 1:]
       for j in inclusion_text:
           inclusion_line.append(j)
       try:
           exclusion_text = space_removal[stop_index[i] + 1:start_index[i + 1]]
           for k in exclusion_text:
               exclusion_line.append(k)
       except IndexError:
           pass
   if len(start_index) < len(stop_index): # stop index should not be greater than start index...if exceeds,it extracts till the start index only
       try:
           inclusion_text = space_removal[start_index[i] + 1:stop_index[i]]
           for j in inclusion_text:
               inclusion_line.append(j)
       except IndexError:
           pass
       try:
           exclusion_text = space_removal[stop_index[i] + 1:start_index[i + 1]]
           for k in exclusion_text:
               exclusion_line.append(k)
       except IndexError:
           pass
   if len(start_index) == len(stop_index):
       inclusion_text = space_removal[start_index[i] + 1:stop_index[i]]
       for j in inclusion_text:
           inclusion_line.append(j)
       try:
           exclusion_text = space_removal[stop_index[i] + 1:start_index[i + 1]]
       except IndexError:
           exclusion_text = space_removal[stop_index[i] + 1:]
       for k in exclusion_text:
           exclusion_line.append(k)


print(f'Inclusion Criteria :{inclusion_line}')
print(f'Exclusion Criteria :{exclusion_line}')

Answered By: Ramesh

Answer 2

If you want to use a pattern, you can use 3 capture groups, and in capture group 1 and 2 match either In or Ex clusion to deternmine the difference.

In capture group 3, you can match all lines that belong to that block.

^.*b(?:([Ii]n)|([Ee]x))clusion [Cc]riteriab.*((?:n(?!.*b(?:[Ii]n|[Ee]x)clusion [Cc]riteriab).*)*)

Explanation

^ Start of string
.*b Match the whole line and then a word boundary
(?: Non capture group
- ([Ii]n)|([Ee]x) Capture In in group 2, or Ex in group 3
) Close the non capture group
clusion [Cc]riteriab Match clusion and the word Criteria
.* Match the rest of the line
( Capture group 3
- (?: Non capture group to repeat as a whole
  - n Match a newline
  - (?!.*b(?:[Ii]n|[Ee]x)clusion [Cc]riteriab) Assert that the line does not contain the exclusion criteria part
  - .* Match the whole line
- )* Close and optionally repeat the non capture group
) Close group 3

See a regex demo with the capture group values.

Capturing the lines in 2 different lists for example:

import re
import pprint
pattern = r"^.*b(?:([Ii]n)|([Ee]x))clusion [Cc]riteriab.*((?:n(?!.*b(?:[Ii]n|[Ee]x)clusion [Cc]riteriab).*)*)"

s = ("Inclusion Criteria for all fruit lovers:nn"
            "extract this inclusion linenn"
            "extract this inclusion line as wellnn"
            "Exclusion Criteria for all fruit lovers:nn"
            "extract this exclusion line 2nn"
            "extract this exclusion line 3 as wellnn"
            "the inclusion criteria for both apple and orange loversnn"
            "extract this exclusion line 2nn"
            "extract this exclusion line 3 as wellnn"
            "the exclusion criteria for both apple and orange loversnn"
            "extract this exclusion line 2nn"
            "extract this exclusion line 3 as well")
matches = re.finditer(pattern, s, re.MULTILINE)

inclusion_criteria = []
exclusion_criteria = []

for matchNum, match in enumerate(matches, start=1):
    if match.group(1):
        inclusion_criteria.append(match.group(3))
    if match.group(2):
        exclusion_criteria.append(match.group(3))

print("Inclusion Criteria")
pprint.pprint([s.strip() for s in inclusion_criteria if s])
print("Exclusion Criteria")
pprint.pprint([s.strip() for s in exclusion_criteria if s])

Output

Inclusion Criteria
['extract this inclusion linennextract this inclusion line as well',
 'extract this exclusion line 2nnextract this exclusion line 3 as well']
Exclusion Criteria
['extract this exclusion line 2nnextract this exclusion line 3 as well',
 'extract this exclusion line 2nnextract this exclusion line 3 as well']

Answered By: The fourth bird

Regex – grouping the titles into a standardized form

Question:

Answers: