Use a list of values to select rows from a Pandas dataframe
Question:
Let’s say I have the following Pandas dataframe:
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
I can subset based on a specific value:
x = df[df['A'] == 3]
x
A B
2 3 3
But how can I subset based on a list of values? – something like this:
list_of_values = [3,6]
y = df[df['A'] in list_of_values]
To get:
A B
1 6 2
2 3 3
Answers:
You can use the isin
method:
In [1]: df = pd.DataFrame({'A': [5,6,3,4], 'B': [1,2,3,5]})
In [2]: df
Out[2]:
A B
0 5 1
1 6 2
2 3 3
3 4 5
In [3]: df[df['A'].isin([3, 6])]
Out[3]:
A B
1 6 2
2 3 3
And to get the opposite use ~
:
In [4]: df[~df['A'].isin([3, 6])]
Out[4]:
A B
0 5 1
3 4 5
You can use the method query:
df.query('A in [6, 3]')
# df.query('A == [6, 3]')
or
lst = [6, 3]
df.query('A in @lst')
# df.query('A == @lst')
Another method;
df.loc[df.apply(lambda x: x.A in [3,6], axis=1)]
Unlike the isin method, this is particularly useful in determining if the list contains a function of the column A
. For example, f(A) = 2*A - 5
as the function;
df.loc[df.apply(lambda x: 2*x.A-5 in [3,6], axis=1)]
It should be noted that this approach is slower than the isin
method.
You can store your values in a list as:
lis = [3,6]
then
df1 = df[df['A'].isin(lis)]
list_of_values
doesn’t have to be a list
; it can be set
, tuple
, dictionary
, numpy array, pandas Series, generator, range
etc. and isin()
and query()
will still work.
Some common problems with selecting rows
1. list_of_values
is a range
If you need to filter within a range, you can use between()
method or query()
.
list_of_values = [3, 4, 5, 6] # a range of values
df[df['A'].between(3, 6)] # or
df.query('3<=A<=6')
2. Return df
in the order of list_of_values
In the OP, the values in list_of_values
don’t appear in that order in df
. If you want df
to return in the order they appear in list_of_values
, i.e. "sort" by list_of_values
, use loc
.
list_of_values = [3, 6]
df.set_index('A').loc[list_of_values].reset_index()
If you want to retain the old index, you can use the following.
list_of_values = [3, 6, 3]
df.reset_index().set_index('A').loc[list_of_values].reset_index().set_index('index').rename_axis(None)
3. Don’t use apply
In general, isin()
and query()
are the best methods for this task; there’s no need for apply()
. For example, for function f(A) = 2*A - 5
on column A
, both isin()
and query()
work much more efficiently:
df[(2*df['A']-5).isin(list_of_values)] # or
df[df['A'].mul(2).sub(5).isin(list_of_values)] # or
df.query("A.mul(2).sub(5) in @list_of_values")
4. Select rows not in list_of_values
To select rows not in list_of_values
, negate isin()
/in
:
df[~df['A'].isin(list_of_values)]
df.query("A not in @list_of_values") # df.query("A != @list_of_values")
5. Select rows where multiple columns are in list_of_values
If you want to filter using both (or multiple) columns, there’s any()
and all()
to reduce columns (axis=1
) depending on the need.
- Select rows where at least one of
A
or B
is in list_of_values
:
df[df[['A','B']].isin(list_of_values).any(1)]
df.query("A in @list_of_values or B in @list_of_values")
- Select rows where both of
A
and B
are in list_of_values
:
df[df[['A','B']].isin(list_of_values).all(1)]
df.query("A in @list_of_values and B in @list_of_values")
Bonus:
You can also call isin()
inside query()
:
df.query("A.isin(@list_of_values).values")
Its trickier with f-Strings
list_of_values = [3,6]
df.query(f'A in {list_of_values}')
The above answers are correct, but if you still are not able to filter out rows as expected, make sure both DataFrames’ columns have the same dtype
.
source = source.astype({1: 'int64'})
to_rem = to_rem.astype({'some col': 'int64'})
works = source[~source[1].isin(to_rem['some col'])]
Took me long enough.
A non pandas solution that compares in terms of speed may be:
filtered_column = set(df.A) - set(list)
Let’s say I have the following Pandas dataframe:
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
I can subset based on a specific value:
x = df[df['A'] == 3]
x
A B
2 3 3
But how can I subset based on a list of values? – something like this:
list_of_values = [3,6]
y = df[df['A'] in list_of_values]
To get:
A B
1 6 2
2 3 3
You can use the isin
method:
In [1]: df = pd.DataFrame({'A': [5,6,3,4], 'B': [1,2,3,5]})
In [2]: df
Out[2]:
A B
0 5 1
1 6 2
2 3 3
3 4 5
In [3]: df[df['A'].isin([3, 6])]
Out[3]:
A B
1 6 2
2 3 3
And to get the opposite use ~
:
In [4]: df[~df['A'].isin([3, 6])]
Out[4]:
A B
0 5 1
3 4 5
You can use the method query:
df.query('A in [6, 3]')
# df.query('A == [6, 3]')
or
lst = [6, 3]
df.query('A in @lst')
# df.query('A == @lst')
Another method;
df.loc[df.apply(lambda x: x.A in [3,6], axis=1)]
Unlike the isin method, this is particularly useful in determining if the list contains a function of the column A
. For example, f(A) = 2*A - 5
as the function;
df.loc[df.apply(lambda x: 2*x.A-5 in [3,6], axis=1)]
It should be noted that this approach is slower than the isin
method.
You can store your values in a list as:
lis = [3,6]
then
df1 = df[df['A'].isin(lis)]
list_of_values
doesn’t have to be a list
; it can be set
, tuple
, dictionary
, numpy array, pandas Series, generator, range
etc. and isin()
and query()
will still work.
Some common problems with selecting rows
1. list_of_values
is a range
If you need to filter within a range, you can use between()
method or query()
.
list_of_values = [3, 4, 5, 6] # a range of values
df[df['A'].between(3, 6)] # or
df.query('3<=A<=6')
2. Return df
in the order of list_of_values
In the OP, the values in list_of_values
don’t appear in that order in df
. If you want df
to return in the order they appear in list_of_values
, i.e. "sort" by list_of_values
, use loc
.
list_of_values = [3, 6]
df.set_index('A').loc[list_of_values].reset_index()
If you want to retain the old index, you can use the following.
list_of_values = [3, 6, 3]
df.reset_index().set_index('A').loc[list_of_values].reset_index().set_index('index').rename_axis(None)
3. Don’t use apply
In general, isin()
and query()
are the best methods for this task; there’s no need for apply()
. For example, for function f(A) = 2*A - 5
on column A
, both isin()
and query()
work much more efficiently:
df[(2*df['A']-5).isin(list_of_values)] # or
df[df['A'].mul(2).sub(5).isin(list_of_values)] # or
df.query("A.mul(2).sub(5) in @list_of_values")
4. Select rows not in list_of_values
To select rows not in list_of_values
, negate isin()
/in
:
df[~df['A'].isin(list_of_values)]
df.query("A not in @list_of_values") # df.query("A != @list_of_values")
5. Select rows where multiple columns are in list_of_values
If you want to filter using both (or multiple) columns, there’s any()
and all()
to reduce columns (axis=1
) depending on the need.
- Select rows where at least one of
A
orB
is inlist_of_values
:df[df[['A','B']].isin(list_of_values).any(1)] df.query("A in @list_of_values or B in @list_of_values")
- Select rows where both of
A
andB
are inlist_of_values
:df[df[['A','B']].isin(list_of_values).all(1)] df.query("A in @list_of_values and B in @list_of_values")
Bonus:
You can also call isin()
inside query()
:
df.query("A.isin(@list_of_values).values")
Its trickier with f-Strings
list_of_values = [3,6]
df.query(f'A in {list_of_values}')
The above answers are correct, but if you still are not able to filter out rows as expected, make sure both DataFrames’ columns have the same dtype
.
source = source.astype({1: 'int64'})
to_rem = to_rem.astype({'some col': 'int64'})
works = source[~source[1].isin(to_rem['some col'])]
Took me long enough.
A non pandas solution that compares in terms of speed may be:
filtered_column = set(df.A) - set(list)