Can stop phrases be removed while doing text processing in python?

Question:

On the task that I’m working on, involves finding the cosine similarity using tfidf between a base transcript and other sample transcripts.

I am removing stop words for this. But I would also like to remove certain stop phrases that are unique to the sample transcripts.

For example – I would like to retain words like ‘sounds’ , ‘like’. But want to remove the phrase ‘sounds like’ when it occurs together.

I am using sklearn tfidfvectorizer package currently. Is there an efficient way to do the above?

Asked By: vp_5614

||

Answers:

Yes, you can achieve this by defining function custom_preprocessor that removes the stop phrases and passing it to the TfidfVectorizer constructor using the preprocessor argument.

def custom_preprocessor(text):
    for stop_phrase in stop_phrases:
        text = text.replace(stop_phrase, '')
    return text
vectorizer = TfidfVectorizer(preprocessor=custom_preprocessor)
Answered By: magedo