Remove small words using Python

Question:

Is it possible use regex to remove small words in a text? For example, I have the following string (text):

anytext = " in the echo chamber from Ontario duo "

I would like remove all words that is 3 characters or less. The Result should be:

"echo chamber from Ontario"

Is it possible do that using regular expression or any other python function?

Thanks.

Asked By: Thomas

||

Answers:

I don’t think you need a regex for this simple example anyway …

' '.join(word for word in anytext.split() if len(word)>3)
Answered By: mgilson

Certainly, it’s not that hard either:

shortword = re.compile(r'W*bw{1,3}b')

The above expression selects any word that is preceded by some non-word characters (essentially whitespace or the start), is between 1 and 3 characters short, and ends on a word boundary.

>>> shortword.sub('', anytext)
' echo chamber from Ontario '

The b boundary matches are important here, they ensure that you don’t match just the first or last 3 characters of a word.

The W* at the start lets you remove both the word and the preceding non-word characters so that the rest of the sentence still matches up. Note that punctuation is included in W, use s if you only want to remove preceding whitespace.

For what it’s worth, this regular expression solution preserves extra whitespace between the rest of the words, while mgilson’s version collapses multiple whitespace characters into one space. Not sure if that matters to you.

His list comprehension solution is the faster of the two:

>>> import timeit
>>> def re_remove(text): return shortword.sub('', text)
... 
>>> def lc_remove(text): return ' '.join(word for word in text.split() if len(word)>3)
... 
>>> timeit.timeit('remove(" in the echo chamber from Ontario duo ")', 'from __main__ import re_remove as remove')
7.0774190425872803
>>> timeit.timeit('remove(" in the echo chamber from Ontario duo ")', 'from __main__ import lc_remove as remove')
6.4250049591064453
Answered By: Martijn Pieters

If you have a list of strings enter it in the str1 variable.

If you have a list, put it in the list_1 varible and delete the code above that variable.

def Convert(string):

    li = list(string.split(" "))
    return li
  
str1 = "Put list of strings to convert into a list here"
list_1 = (Convert(str1))

#Above is a string to list converter

def listToString(s):
    str2 = " " 
    return (str2.join(s))
    
anytext = (listToString(list_1)) 

print(' '.join(word for word in anytext.split() if len(word)>1))

#The number above is how many character of words you want to change
Answered By: AG3012

If you have a list of strings enter it in the str1 variable.

If you have a list, put it in the list_1 varible and delete the code above that variable.

def Convert(string):

    li = list(string.split(" "))
    return li
  
str1 = "Put list of strings to convert into a list here"
list_1 = (Convert(str1))

#Above is a string to list converter

def listToString(s):
    str2 = " " 
    return (str2.join(s))
    
anytext = (listToString(list_1)) 

print(' '.join(word for word in anytext.split() if len(word)>1))

#The number above is how many character of words you want to change
Answered By: AG3012

The best way to do this simply with this

re.findall(r'bw+w{3,}b', 'in the echo chamber from Ontario duo'))

the result is what you really want but note that this will give you a list not a string

Answered By: mohamed ali
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.