Python & Pandas: How to query if a list-type column contains something?

Question:

I have a dataframe, which contains info about movies. It has a column called genre, which contains a list of genres it belongs to. For example:

df['genre']

## returns 

0       ['comedy', 'sci-fi']
1       ['action', 'romance', 'comedy']
2       ['documentary']
3       ['crime','horror']
...

I want to know how can I query the dataframe, so it returns the movie belongs to a cerain genre?

For example, something may like df['genre'].contains('comedy') returns 0 or 1.

I know for a list, I can do things like:

'comedy' in  ['comedy', 'sci-fi']

However, in pandas, I didn’t find something similar, the only thing I know is df['genre'].str.contains(), but it didn’t work for the list type.

Asked By: cqcn1991

||

Answers:

You can use apply for create mask and then boolean indexing:

mask = df.genre.apply(lambda x: 'comedy' in x)
df1 = df[mask]
print (df1)
                       genre
0           [comedy, sci-fi]
1  [action, romance, comedy]
Answered By: jezrael

using sets

df.genre.map(set(['comedy']).issubset)

0     True
1     True
2    False
3    False
dtype: bool

df.genre[df.genre.map(set(['comedy']).issubset)]

0             [comedy, sci-fi]
1    [action, romance, comedy]
dtype: object

presented in a way I like better

comedy = set(['comedy'])
iscomedy = comedy.issubset
df[df.genre.map(iscomedy)]

more efficient

comedy = set(['comedy'])
iscomedy = comedy.issubset
df[[iscomedy(l) for l in df.genre.values.tolist()]]

using str in two passes
slow! and not perfectly accurate!

df[df.genre.str.join(' ').str.contains('comedy')]
Answered By: piRSquared

According to the source code, you can use .str.contains(..., regex=False).

Answered By: HYRY

A complete example:

import pandas as pd

data = pd.DataFrame([[['foo', 'bar']],
                    [['bar', 'baz']]], columns=['list_column'])
print(data)
  list_column
0  [foo, bar]
1  [bar, baz]

filtered_data = data.loc[
    lambda df: df.list_column.apply(
        lambda l: 'foo' in l
    )
]
print(filtered_data)
  list_column
0  [foo, bar]
Answered By: Adrien Renaud

You need to set regex=False and .str.contains will work for list values as you would expect:

In : df['genre'].str.contains('comedy', regex=False)
Out:
0     True
1     True
2    False
3    False
Name: genre, dtype: bool
Answered By: joctee

This can be done in all three ways as suggested, using str.contains, set or apply and in. Although using set is the most efficient way to achieve this.

Here’s a performance comparison of the three methods on an extrapolated dataframe with 10,000 rows:

set

%%timeit -n 500 -r 35
df[df.genre.map(set(['comedy']).issubset)]
2.23 ms ± 154 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)

apply & in

%%timeit -n 500 -r 35
df[df.genre.apply(lambda x: 'comedy' in x)]
2.36 ms ± 359 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)

str.contains

%%timeit -n 500 -r 35
df[df['genre'].str.contains('comedy', regex=False)]
2.83 ms ± 299 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)
Answered By: LucyDrops

 This can be done using the isin method to return a new dataframe that contains boolean values where each item is located.

df1[df1.name.isin(['Rohit','Rahul'])]

here df1 is a dataframe object and name is a string series

>>> df1[df1.name.isin(['Rohit','Rahul'])]
   sample1   name  Marks  Class 
0        1  Rohit     34     10
1        2  Rahul     56     12
>>> type (df1)
<class 'pandas.core.frame.DataFrame> 
>>> df1.head()
   sample1    name  Marks  Class
0        1   Rohit     34     10
1        2   Rahul     56     12
2        3   ankit     78     11
3        4   sajan     98     10
4        5  chintu     76      9
Answered By: Rohit Panwar
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.