How to extract sentences of text that contain keywords in list
Question:
I’m trying to return all sentences that contain ‘any’ words in a list, but the result only returns the sentence for the second word in the list. In the example below, I wanted to extract the sentence that contained inflation and commodity, not just commodity. Any help would be appreciated.
text = 'inflation is very high. commodity prices are rising a lot. this is an extra sentence'
words = ['inflation', 'commodity']
for word in words:
[words.casefold() for words in words] #to ignore cases in text
def extract_word(text):
return [sentence for sentence in text.split('.') if word in sentence]
extract_word(text)
[' commodity prices are rising a lot']
Answers:
The condition if word in sentence
will check if the iterator word
from the for
loop is in sentence
. Since "commodity"
is the last element in the list words
, after the for
loop, word
will contain the string "commodity"
.
Instead, in the list comprehension statement, you can check if any of the elements in words
is in sentence
, such as below:
text = 'inflation is very high. commodity prices are rising a lot. this is an extra sentence'
words = ['inflation', 'commodity']
sentences = [
sentence for sentence in text.split(".") if any(
w.lower() in sentence.lower() for w in words
)
]
print(sentences)
# >>> ['inflation is very high', ' commodity prices are rising a lot']
You can try this:
texts = ' commodity prices are rising a lot. some random text. this text contains the word: inflation'
words = ['inflation','commodity']
lst_words = [words.casefold() for words in words] #to ignore cases in text
def found_word(sentence, lst_words):
return any(word in lst_words for word in sentence.split())
def extract_word(text):
lst_sentences = []
for sentence in text.split('.'):
if found_word(sentence, lst_words):
lst_sentences.append([sentence + '.'])
return lst_sentences
extract_word(texts)
# [[' commodity prices are rising a lot.'],
# [' this text contains the word: inflation.']]
It is a bit longer, but I think much better to read.
I find generators to be pretty handy in these kinds of cases.
def extract_word(text):
words = ['inflation', 'commodity']
sentences = text.split('.')
for sentence in sentences:
if any(word in sentence for word in words):
yield sentence
>>> list(extract_word('inflation is very high. commodity prices are rising a lot. this is an extra sentence'))
['inflation is very high', ' commodity prices are rising a lot']
It’s readable and easy to understand what the outcome is.
I’m trying to return all sentences that contain ‘any’ words in a list, but the result only returns the sentence for the second word in the list. In the example below, I wanted to extract the sentence that contained inflation and commodity, not just commodity. Any help would be appreciated.
text = 'inflation is very high. commodity prices are rising a lot. this is an extra sentence'
words = ['inflation', 'commodity']
for word in words:
[words.casefold() for words in words] #to ignore cases in text
def extract_word(text):
return [sentence for sentence in text.split('.') if word in sentence]
extract_word(text)
[' commodity prices are rising a lot']
The condition if word in sentence
will check if the iterator word
from the for
loop is in sentence
. Since "commodity"
is the last element in the list words
, after the for
loop, word
will contain the string "commodity"
.
Instead, in the list comprehension statement, you can check if any of the elements in words
is in sentence
, such as below:
text = 'inflation is very high. commodity prices are rising a lot. this is an extra sentence'
words = ['inflation', 'commodity']
sentences = [
sentence for sentence in text.split(".") if any(
w.lower() in sentence.lower() for w in words
)
]
print(sentences)
# >>> ['inflation is very high', ' commodity prices are rising a lot']
You can try this:
texts = ' commodity prices are rising a lot. some random text. this text contains the word: inflation'
words = ['inflation','commodity']
lst_words = [words.casefold() for words in words] #to ignore cases in text
def found_word(sentence, lst_words):
return any(word in lst_words for word in sentence.split())
def extract_word(text):
lst_sentences = []
for sentence in text.split('.'):
if found_word(sentence, lst_words):
lst_sentences.append([sentence + '.'])
return lst_sentences
extract_word(texts)
# [[' commodity prices are rising a lot.'],
# [' this text contains the word: inflation.']]
It is a bit longer, but I think much better to read.
I find generators to be pretty handy in these kinds of cases.
def extract_word(text):
words = ['inflation', 'commodity']
sentences = text.split('.')
for sentence in sentences:
if any(word in sentence for word in words):
yield sentence
>>> list(extract_word('inflation is very high. commodity prices are rising a lot. this is an extra sentence'))
['inflation is very high', ' commodity prices are rising a lot']
It’s readable and easy to understand what the outcome is.