Python & Pandas: How to query if a list-type column contains something?
Question:
I have a dataframe, which contains info about movies. It has a column called genre
, which contains a list of genres it belongs to. For example:
df['genre']
## returns
0 ['comedy', 'sci-fi']
1 ['action', 'romance', 'comedy']
2 ['documentary']
3 ['crime','horror']
...
I want to know how can I query the dataframe, so it returns the movie belongs to a cerain genre?
For example, something may like df['genre'].contains('comedy')
returns 0 or 1.
I know for a list, I can do things like:
'comedy' in ['comedy', 'sci-fi']
However, in pandas, I didn’t find something similar, the only thing I know is df['genre'].str.contains()
, but it didn’t work for the list type.
Answers:
You can use apply
for create mask
and then boolean indexing
:
mask = df.genre.apply(lambda x: 'comedy' in x)
df1 = df[mask]
print (df1)
genre
0 [comedy, sci-fi]
1 [action, romance, comedy]
using sets
df.genre.map(set(['comedy']).issubset)
0 True
1 True
2 False
3 False
dtype: bool
df.genre[df.genre.map(set(['comedy']).issubset)]
0 [comedy, sci-fi]
1 [action, romance, comedy]
dtype: object
presented in a way I like better
comedy = set(['comedy'])
iscomedy = comedy.issubset
df[df.genre.map(iscomedy)]
more efficient
comedy = set(['comedy'])
iscomedy = comedy.issubset
df[[iscomedy(l) for l in df.genre.values.tolist()]]
using str
in two passes
slow! and not perfectly accurate!
df[df.genre.str.join(' ').str.contains('comedy')]
According to the source code, you can use .str.contains(..., regex=False)
.
A complete example:
import pandas as pd
data = pd.DataFrame([[['foo', 'bar']],
[['bar', 'baz']]], columns=['list_column'])
print(data)
list_column
0 [foo, bar]
1 [bar, baz]
filtered_data = data.loc[
lambda df: df.list_column.apply(
lambda l: 'foo' in l
)
]
print(filtered_data)
list_column
0 [foo, bar]
You need to set regex=False
and .str.contains
will work for list values as you would expect:
In : df['genre'].str.contains('comedy', regex=False)
Out:
0 True
1 True
2 False
3 False
Name: genre, dtype: bool
This can be done in all three ways as suggested, using str.contains
, set
or apply
and in
. Although using set
is the most efficient way to achieve this.
Here’s a performance comparison of the three methods on an extrapolated dataframe with 10,000 rows:
set
%%timeit -n 500 -r 35
df[df.genre.map(set(['comedy']).issubset)]
2.23 ms ± 154 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)
apply & in
%%timeit -n 500 -r 35
df[df.genre.apply(lambda x: 'comedy' in x)]
2.36 ms ± 359 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)
str.contains
%%timeit -n 500 -r 35
df[df['genre'].str.contains('comedy', regex=False)]
2.83 ms ± 299 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)
This can be done using the isin
method to return a new dataframe that contains boolean values where each item is located.
df1[df1.name.isin(['Rohit','Rahul'])]
here df1
is a dataframe
object and name
is a string series
>>> df1[df1.name.isin(['Rohit','Rahul'])]
sample1 name Marks Class
0 1 Rohit 34 10
1 2 Rahul 56 12
>>> type (df1)
<class 'pandas.core.frame.DataFrame>
>>> df1.head()
sample1 name Marks Class
0 1 Rohit 34 10
1 2 Rahul 56 12
2 3 ankit 78 11
3 4 sajan 98 10
4 5 chintu 76 9
I have a dataframe, which contains info about movies. It has a column called genre
, which contains a list of genres it belongs to. For example:
df['genre']
## returns
0 ['comedy', 'sci-fi']
1 ['action', 'romance', 'comedy']
2 ['documentary']
3 ['crime','horror']
...
I want to know how can I query the dataframe, so it returns the movie belongs to a cerain genre?
For example, something may like df['genre'].contains('comedy')
returns 0 or 1.
I know for a list, I can do things like:
'comedy' in ['comedy', 'sci-fi']
However, in pandas, I didn’t find something similar, the only thing I know is df['genre'].str.contains()
, but it didn’t work for the list type.
You can use apply
for create mask
and then boolean indexing
:
mask = df.genre.apply(lambda x: 'comedy' in x)
df1 = df[mask]
print (df1)
genre
0 [comedy, sci-fi]
1 [action, romance, comedy]
using sets
df.genre.map(set(['comedy']).issubset)
0 True
1 True
2 False
3 False
dtype: bool
df.genre[df.genre.map(set(['comedy']).issubset)]
0 [comedy, sci-fi]
1 [action, romance, comedy]
dtype: object
presented in a way I like better
comedy = set(['comedy'])
iscomedy = comedy.issubset
df[df.genre.map(iscomedy)]
more efficient
comedy = set(['comedy'])
iscomedy = comedy.issubset
df[[iscomedy(l) for l in df.genre.values.tolist()]]
using str
in two passes
slow! and not perfectly accurate!
df[df.genre.str.join(' ').str.contains('comedy')]
According to the source code, you can use .str.contains(..., regex=False)
.
A complete example:
import pandas as pd
data = pd.DataFrame([[['foo', 'bar']],
[['bar', 'baz']]], columns=['list_column'])
print(data)
list_column
0 [foo, bar]
1 [bar, baz]
filtered_data = data.loc[
lambda df: df.list_column.apply(
lambda l: 'foo' in l
)
]
print(filtered_data)
list_column
0 [foo, bar]
You need to set regex=False
and .str.contains
will work for list values as you would expect:
In : df['genre'].str.contains('comedy', regex=False)
Out:
0 True
1 True
2 False
3 False
Name: genre, dtype: bool
This can be done in all three ways as suggested, using str.contains
, set
or apply
and in
. Although using set
is the most efficient way to achieve this.
Here’s a performance comparison of the three methods on an extrapolated dataframe with 10,000 rows:
set
%%timeit -n 500 -r 35
df[df.genre.map(set(['comedy']).issubset)]
2.23 ms ± 154 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)
apply & in
%%timeit -n 500 -r 35
df[df.genre.apply(lambda x: 'comedy' in x)]
2.36 ms ± 359 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)
str.contains
%%timeit -n 500 -r 35
df[df['genre'].str.contains('comedy', regex=False)]
2.83 ms ± 299 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)
This can be done using the isin
method to return a new dataframe that contains boolean values where each item is located.
df1[df1.name.isin(['Rohit','Rahul'])]
here df1
is a dataframe
object and name
is a string series
>>> df1[df1.name.isin(['Rohit','Rahul'])]
sample1 name Marks Class
0 1 Rohit 34 10
1 2 Rahul 56 12
>>> type (df1)
<class 'pandas.core.frame.DataFrame>
>>> df1.head()
sample1 name Marks Class
0 1 Rohit 34 10
1 2 Rahul 56 12
2 3 ankit 78 11
3 4 sajan 98 10
4 5 chintu 76 9