Remove word inside tuple that only appear once across dataset
Question:
I have a set of data with multiple rows (>1000) containing tuples of words. I wanted to remove words inside the tuple that only appear once across all rows. Here is an example of the data…
before_cleaning after_cleaning
0 [cool] [cool]
1 [gooooood] []
2 [we, love, it, cool] [love, it, cool]
3 [love, it] [love, it]
Column before_cleaning is the initial data, and column after_cleaning is what I expect the data to look like after the removal. As you can see in the example, "gooooood" and "we" is removed as the words only appear once across row 0 until row 3.
Answers:
You use lambda fun, and inside you can loop over each row list and check if count is more than 1 or not.
Code;
df['after'] = df['before'].apply(lambda row: [i for i in row if sum(list(df['before']),[]).count(i)>1])
Use collections.Counter
and itertools.chain
, a set
and a list comprehension:
from collections import Counter
from itertools import chain
keep = {k for k,v in Counter(chain.from_iterable(df['before_cleaning'])).items()
if v>1}
# {'cool', 'it', 'love'}
df['after_cleaning'] = [[x for x in l if x in keep]
for l in df['before_cleaning']]
Output:
before_cleaning after_cleaning
0 [cool] [cool]
1 [good] []
2 [we, love, it, cool] [love, it, cool]
3 [love, it] [love, it]
Pandas alternative to create the set
:
keep = set(df['before_cleaning'].explode().value_counts().loc[lambda x: x>1].index)
I have a set of data with multiple rows (>1000) containing tuples of words. I wanted to remove words inside the tuple that only appear once across all rows. Here is an example of the data…
before_cleaning after_cleaning
0 [cool] [cool]
1 [gooooood] []
2 [we, love, it, cool] [love, it, cool]
3 [love, it] [love, it]
Column before_cleaning is the initial data, and column after_cleaning is what I expect the data to look like after the removal. As you can see in the example, "gooooood" and "we" is removed as the words only appear once across row 0 until row 3.
You use lambda fun, and inside you can loop over each row list and check if count is more than 1 or not.
Code;
df['after'] = df['before'].apply(lambda row: [i for i in row if sum(list(df['before']),[]).count(i)>1])
Use collections.Counter
and itertools.chain
, a set
and a list comprehension:
from collections import Counter
from itertools import chain
keep = {k for k,v in Counter(chain.from_iterable(df['before_cleaning'])).items()
if v>1}
# {'cool', 'it', 'love'}
df['after_cleaning'] = [[x for x in l if x in keep]
for l in df['before_cleaning']]
Output:
before_cleaning after_cleaning
0 [cool] [cool]
1 [good] []
2 [we, love, it, cool] [love, it, cool]
3 [love, it] [love, it]
Pandas alternative to create the set
:
keep = set(df['before_cleaning'].explode().value_counts().loc[lambda x: x>1].index)