How to slice text preceding a list of re.findall results?
Question:
Text:
some text some text Jack is the CEO. some text some text John DOE is the CEO.
Function to find all the ‘is the CEO’ in the text.
def get_ceo(text):
results = re.findall(r"is the CEO", text)
for i in results:
range = text[i-15:i]
print(range)
With get_ceo
, I want to extract the result of findall
+ 15 characters of the text preceding it. I’m putting an arbitrary number of characters and I’ll then perform an entity extraction with NLP on the range returned for each result.
Desired output:
['some text Jack is the CEO',' text John DOE is the CEO']
Here is the error I’m getting with the function:
line 62, in <module>
print(get_ceo(text))
line 50, in get_ceo
range = text[i-15:i]
TypeError: unsupported operand type(s) for -: 'str' and 'int'
Do I need to convert the result of the findall
function into a different type or change the approach completely?
Answers:
You want to use finditer
instead of findall
since findall
gets the string itself and with finditer
you can access the index of the substring you are looking for.
def get_ceo(text):
results = re.finditer(r"is the CEO", text)
for i in results:
range = text[i.start()-15:i.end()]
print(range)
text = "some text some text Jack is the CEO. some text some text John DOE is the CEO."
get_ceo(text)
output:
some text Jack is the CEO
text John DOE is the CEO
Instead of finding is the CEO
, use that as a lookahead and match the 15 characters before it.
def get_ceo(text):
results = re.findall(r'.{1,15}(?=is the CEO)', text)
print(results)
prints:
['some text Jack ', ' text John DOE ']
Have a try with:
import re
def get_ceo(text):
results = re.findall(r'.{0,15}?is the CEO', text)
print(results)
l = ['some text some text Jack is the CEO. some text some text John DOE is the CEO.',
'a is the CEO b is the CEO',
'is the CEO there?']
for el in l:
get_ceo(el)
Prints:
['some text Jack is the CEO', ' text John DOE is the CEO']
['a is the CEO', ' b is the CEO']
['is the CEO']
Text:
some text some text Jack is the CEO. some text some text John DOE is the CEO.
Function to find all the ‘is the CEO’ in the text.
def get_ceo(text):
results = re.findall(r"is the CEO", text)
for i in results:
range = text[i-15:i]
print(range)
With get_ceo
, I want to extract the result of findall
+ 15 characters of the text preceding it. I’m putting an arbitrary number of characters and I’ll then perform an entity extraction with NLP on the range returned for each result.
Desired output:
['some text Jack is the CEO',' text John DOE is the CEO']
Here is the error I’m getting with the function:
line 62, in <module>
print(get_ceo(text))
line 50, in get_ceo
range = text[i-15:i]
TypeError: unsupported operand type(s) for -: 'str' and 'int'
Do I need to convert the result of the findall
function into a different type or change the approach completely?
You want to use finditer
instead of findall
since findall
gets the string itself and with finditer
you can access the index of the substring you are looking for.
def get_ceo(text):
results = re.finditer(r"is the CEO", text)
for i in results:
range = text[i.start()-15:i.end()]
print(range)
text = "some text some text Jack is the CEO. some text some text John DOE is the CEO."
get_ceo(text)
output:
some text Jack is the CEO
text John DOE is the CEO
Instead of finding is the CEO
, use that as a lookahead and match the 15 characters before it.
def get_ceo(text):
results = re.findall(r'.{1,15}(?=is the CEO)', text)
print(results)
prints:
['some text Jack ', ' text John DOE ']
Have a try with:
import re
def get_ceo(text):
results = re.findall(r'.{0,15}?is the CEO', text)
print(results)
l = ['some text some text Jack is the CEO. some text some text John DOE is the CEO.',
'a is the CEO b is the CEO',
'is the CEO there?']
for el in l:
get_ceo(el)
Prints:
['some text Jack is the CEO', ' text John DOE is the CEO']
['a is the CEO', ' b is the CEO']
['is the CEO']