Finding common words in a column based on values from another column

Question:

In a dataframe with a column named source, made of two different word lists

 source  words  letter_count
1 list1  apple       5
2 list1  pear        4
3 list1  banana      6
4 list2  ford        4
5 list2  chevy       5
6 list2  apple       5
7 list2  banana      6

I’m trying to return a new dataframe that shows the duplicate words in list1 and list2

   words   letter_count
1  apple        5
2  banana       6

I’m using python and pandas

Asked By: peoplet

||

Answers:

I think you’re looking for pandas.Series.duplicated(). It returns a mask (a series containing True/False values corresponding to values that match a condition) where values that occur more than once in the series are True, and those that occur only are False. Then, you can index the dataframe with that mask:

new_df = df[df['words'].duplicated()].drop('source', axis=1)

Output:

>>> new_df
    words  letter_count
6  banana             6
7   apple             5
Answered By: user17242583

Here is a way to find if the same word exists in both lists in the source column.

df.loc[df['words'].isin(set.intersection(*df.groupby('source')['words'].agg(set))),['words','letter_count']].drop_duplicates('words',keep='last')

or:

l = ['words','letter_count']
m1 = df.duplicated(['words','letter_count'])
m2 = df.groupby('words')['source'].transform('nunique').eq(df['source'].nunique())

df.loc[m1 & m2,l]
Output:

        words letter_count
    6   apple            5
    7  banana            6
Answered By: rhug123