Pandas dataframe select rows where a list-column contains any of a list of strings


I’ve got a pandas DataFrame that looks like this:

  molecule            species
0        a              [dog]
1        b       [horse, pig]
2        c         [cat, dog]
3        d  [cat, horse, pig]
4        e     [chicken, pig]

and I like to extract a DataFrame containing only thoses rows, that contain any of selection = ['cat', 'dog']. So the result should look like this:

  molecule            species
0        a              [dog]
1        c         [cat, dog]
2        d  [cat, horse, pig]

What would be the simplest way to do this?

For testing:

selection = ['cat', 'dog']
df = pd.DataFrame({'molecule': ['a','b','c','d','e'], 'species' : [['dog'], ['horse','pig'],['cat', 'dog'], ['cat','horse','pig'], ['chicken','pig']]})
Asked By: NicoH



You can use mask with apply here.

selection = ['cat', 'dog']

mask = df.species.apply(lambda x: any(item for item in selection if item in x))
df1 = df[mask]

For the DataFrame you’ve provided as an example above, df1 will be:

molecule    species
0   a   [dog]
2   c   [cat, dog]
3   d   [cat, horse, pig]
Answered By: Wes Doyle

Using Numpy would be much faster than using Pandas in this case,

Option 1: Using numpy intersection,

mask =  df.species.apply(lambda x: np.intersect1d(x, selection).size > 0)
450 µs ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

    molecule    species
0   a   [dog]
2   c   [cat, dog]
3   d   [cat, horse, pig]

Option2: A similar solution as above using numpy in1d,

df[df.species.apply(lambda x: np.any(np.in1d(x, selection)))]
420 µs ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Option 3: Interestingly, using pure python set is quite fast here

df[df.species.apply(lambda x: bool(set(x) & set(selection)))]
305 µs ± 5.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Answered By: Vaishali

This is an easy and basic approach.
You can create a function that checks if the elements in Selection list are present in the pandas column list.

def check(speciesList):
    flag = False
    for animal in selection:
        if animal in speciesList:
            flag = True
    return flag

You could then use this list to create a column that contains True or False based on whether the record contains at least one element in Selection List and create a new data frame based on it.

df['containsCatDog'] = df.species.apply(lambda animals: check(animals))
newDf = df[df.containsCatDog == True]

I hope it helps.

Answered By: Command

IIUC Re-create your df then using isin with any should be faster than apply

  molecule            species
0        a              [dog]
2        c         [cat, dog]
3        d  [cat, horse, pig]
Answered By: BENY

Using pandas str.contains (uses regular expression):

df[~df["species"].str.contains('(cat|dog)', regex=True)]


    molecule    species
1   b   [horse, pig]
4   e   [chicken, pig]
Answered By: Ken Dekalb
import  pandas as pd
import numpy as np
selection = ['cat', 'dog']
df = pd.DataFrame({'molecule': ['a','b','c','d','e'], 'species' : [['dog'], ['horse','pig'],['cat', 'dog'], ['cat','horse','pig'], ['chicken','pig']]})

df1 = df[df['species'].apply((lambda x: 'dog' in x) )]
df2=df[df['species'].apply((lambda x: 'cat' in x) )]
frames = [df1, df2]
result = pd.concat(frames,join='inner',ignore_index=False)
result = result[~result.index.duplicated(keep='first')]
Answered By: ALEN M A
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.