Remove duplicate words in the same cell within a column in python

Question:

i need somebody’s help, i have a column with words, i want to remove the duplicated words inside each cell

what i want to get is something like this

words expected
car apple car good car apple good
good bad well good good bad well
car apple bus food car apple bus food

i’ve tried this but is not working

from collections import OrderedDict


df['expected'] = (df['words'].str.split().apply(lambda x: OrderedDict.fromkeys(x).keys()).str.join(' '))

I’ll be very grateful if somebody can help me

Asked By: Sebastian R

||

Answers:

if words are string "word1 word2":

df['expected'] = [" ".join(set(wrds.strip().split())) for wrds in df.words] 
Answered By: dermen

If you don’t need to retain the original order of the words, you can create an intermediate set which will remove duplicates.

df["expected"] = df["words"].str.split().apply(set).str.join(" ")
Answered By: tdelaney

If order is important use dict.fromkeys in a list comprehension:

df['expected'] = [' '.join(dict.fromkeys(w.split())) for w in df['words']]

output:

                words            expected
0  car apple car good      car apple good
1  good bad well good       good bad well
2  car apple bus food  car apple bus food
Answered By: mozway
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.