Remove adjacent duplicate words in a string with Python?

Question:

How would I remove adjacent duplicate words in a string. For example
‘Hey there There’ -> ‘Hey there’

Asked By: user1655130

||

Answers:

Using re.sub with a backreference we can try:

inp = 'Hey there There'
output = re.sub(r'(w+) 1', r'1', inp, flags=re.IGNORECASE)
print(output)  # Hey there

The regex pattern used here says to:

(w+)  match and capture a word
[ ]    followed by a space
1     then followed by the same word (ignoring case)

Then, we just replace with the first adjacent word.

Answered By: Tim Biegeleisen
inp = 'Hey there There'
output = re.sub(r'b(w+) 1b', r'1', inp, flags=re.IGNORECASE)
print(output)  # Hey there

inp = 'Hey there eating?'
output = re.sub(r'b(w+) 1b', r'1', inp, flags=re.IGNORECASE)
print(output)  # Hey there eating?

b ensures word boundary and captures the entire word instead of character. The second test case ("Hey there eating?") does not work with https://stackoverflow.com/a/68481181/8439676 answer given by Tim Biegeleisen.

Remove adjacent duplicate words recursively

   def removeConsecutiveDuplicateWors(s):
        st = s.split()
        if len(st) < 2:
            return " ".join(st)
        if st[0] != st[1]:
            nw =  ("".join(st[0])) +" "+ removeConsecutiveDuplicateWors(" ".join(st[1:]))
            return nw
        return removeConsecutiveDuplicateWors(" ".join(st[1:]))
      
    
    string = 'I am a duplicate duplicate word in a sentence. How I can be be be removed?'
    print(removeConsecutiveDuplicateWors(string))  

output : I am a duplicate word in a sentence. How I can be removed?

Answered By: Farhad Kabir

Rohit Sharma’s answer should be accepted, as it does in fact take word boundaries into account. The original answer would incorrectly change Hey there eating to Hey thereating

Alternatively, one could use the following regex (which will produce a slightly different output in some scenarios; see examples below):

my_output = re.sub(r'b(w+)(?:W+1b)+', r'1', my_input, flags=re.IGNORECASE)

Example 1:

INPUT: Buying food food in the supermarket

ROHITS VERSION OUTPUT: Buying food in the supermarket

ABOVE VERSION OUTPUT: Buying food in the supermarket

Example 2:

INPUT: Food: Food and Beverages

ROHITS VERSION OUTPUT: Food: Food and Beverages (unchanged)

ABOVE VERSION OUTPUT: Food and Beverages

Explanation:

“b”: A word boundary. Boundaries are needed for special cases. For example, in “My thesis is great”, “is” wont be matched twice.

“w+” A word character: [a-zA-Z_0-9]

“W+”: A non-word character: [^w]

“1”: Matches whatever was matched in the 1st group of parentheses, which in this case is the (w+)

“+”: Match whatever it’s placed after 1 or more times

Credits:

I adapted this code to Python but it originates from this geeksforgeeks.org post

Answered By: H3lix
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.