remove extra words from text

Question

ive been trying to remove extra words like {'by','the','and','of' ,'a'}
from text so my best way to do it is like this .

Code :

def clean_text(text):
    """
    takes the text and removes signs and some words
    """
    stopwords = {'by','the','and','of' ,'a'}
    result  = [word for word in re.split("W+",text) if word.lower() not in stopwords]
    result = (' ').join(result)
    print(result)
    return result

#dummy text
long_string = "one Groups are marked by the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there"
clean_text(long_string)

my question is , is there any better way to do it without using forloop , does regex has any method to remove some words from text and ignore using forloop

Asked By: ImThePeak

||

Source

Answer 1

you can use this regex pattern to remove extra words from your text , this regex checks charcter of word and remove it if charcter of word be between 1 and 3

import re
shortword = re.compile(r'W*bw{1,3}b')
shortword.sub('', anytext)

Answered By: Mohammadreza Nazif

Answer 2

You could use a regex replacement approach by forming an alternation of stop words and then removing them.

long_string = "one Groups are marked by the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there"
words = ["by", "the", "and", "of", "a"]
regex = r's*b(?:' + r'|'.join(words) + r')bs*'
output = re.sub(regex, ' ', long_string).strip()
print(output)

This prints:

one Groups are marked ()meta-characters. two They group together expressions contained one inside them, you can one repeat contents group with repeating qualifier, such as there

Answered By: Tim Biegeleisen

remove extra words from text

Question:

Answers: