Remove word inside tuple that only appear once across dataset

Question:

I have a set of data with multiple rows (>1000) containing tuples of words. I wanted to remove words inside the tuple that only appear once across all rows. Here is an example of the data…

        before_cleaning    after_cleaning
0                [cool]            [cool]
1            [gooooood]                []
2  [we, love, it, cool]  [love, it, cool]
3            [love, it]        [love, it]

Column before_cleaning is the initial data, and column after_cleaning is what I expect the data to look like after the removal. As you can see in the example, "gooooood" and "we" is removed as the words only appear once across row 0 until row 3.

Asked By: Christabel

||

Answers:

You use lambda fun, and inside you can loop over each row list and check if count is more than 1 or not.

Code;

df['after'] = df['before'].apply(lambda row: [i for i in row if sum(list(df['before']),[]).count(i)>1])
Answered By: R. Baraiya

Use collections.Counter and itertools.chain, a set and a list comprehension:

from collections import Counter
from itertools import chain

keep = {k for k,v in Counter(chain.from_iterable(df['before_cleaning'])).items()
        if v>1}
# {'cool', 'it', 'love'}

df['after_cleaning'] = [[x for x in l if x in keep]
                        for l in df['before_cleaning']]

Output:

        before_cleaning    after_cleaning
0                [cool]            [cool]
1                [good]                []
2  [we, love, it, cool]  [love, it, cool]
3            [love, it]        [love, it]

Pandas alternative to create the set:

keep = set(df['before_cleaning'].explode().value_counts().loc[lambda x: x>1].index)
Answered By: mozway
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.