How to slice text preceding a list of re.findall results?

Question

Text:

some text some text Jack is the CEO. some text some text John DOE is the CEO.

Function to find all the ‘is the CEO’ in the text.

def get_ceo(text):
   results = re.findall(r"is the CEO", text)
   for i in results:
       range = text[i-15:i]
       print(range)

With get_ceo, I want to extract the result of findall + 15 characters of the text preceding it. I’m putting an arbitrary number of characters and I’ll then perform an entity extraction with NLP on the range returned for each result.

Desired output:
['some text Jack is the CEO',' text John DOE is the CEO']

Here is the error I’m getting with the function:

  line 62, in <module>
    print(get_ceo(text))
  line 50, in get_ceo
    range = text[i-15:i]
TypeError: unsupported operand type(s) for -: 'str' and 'int'

Do I need to convert the result of the findall function into a different type or change the approach completely?

Asked By: user16779293

||

Source

Answer 1

You want to use finditer instead of findall
since findall gets the string itself and with finditer you can access the index of the substring you are looking for.

def get_ceo(text):
   results = re.finditer(r"is the CEO", text)
   for i in results:
       range = text[i.start()-15:i.end()]
       print(range)

text = "some text some text Jack is the CEO. some text some text John DOE is the CEO."
get_ceo(text)

output:

some text Jack is the CEO
 text John DOE is the CEO

Answered By: Ohad Sharet

Answer 2

Instead of finding is the CEO, use that as a lookahead and match the 15 characters before it.

def get_ceo(text):
    results = re.findall(r'.{1,15}(?=is the CEO)', text)
    print(results)

prints:

['some text Jack ', ' text John DOE ']

Answered By: Barmar

Answer 3

Have a try with:

import re

def get_ceo(text):
    results = re.findall(r'.{0,15}?is the CEO', text)
    print(results)

l = ['some text some text Jack is the CEO. some text some text John DOE is the CEO.', 
     'a is the CEO b is the CEO', 
     'is the CEO there?']

for el in l:
    get_ceo(el)

Prints:

['some text Jack is the CEO', ' text John DOE is the CEO']
['a is the CEO', ' b is the CEO']
['is the CEO']

Answered By: JvdV

How to slice text preceding a list of re.findall results?

Question:

Answers: