How to implement a Boolean search with multiple columns in pandas
Question:
I have a pandas df and would like to accomplish something along these lines (in SQL terms):
SELECT * FROM df WHERE column1 = 'a' OR column2 = 'b' OR column3 = 'c' etc.
Now this works, for one column/value pair:
foo = df.loc[df['column']==value]
However, I’m not sure how to expand that to multiple column/value pairs.
- To be clear, each column matches a different value.
Answers:
You need to enclose multiple conditions in braces due to operator precedence and use the bitwise and (&
) and or (|
) operators:
foo = df[(df['column1']==value) | (df['columns2'] == 'b') | (df['column3'] == 'c')]
If you use and
or or
, then pandas is likely to moan that the comparison is ambiguous. In that case, it is unclear whether we are comparing every value in a series in the condition, and what does it mean if only 1 or all but 1 match the condition. That is why you should use the bitwise operators or the numpy np.all
or np.any
to specify the matching criteria.
There is also the query method: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.query.html
but there are some limitations mainly to do with issues where there could be ambiguity between column names and index values.
A more concise–but not necessarily faster–method is to use DataFrame.isin()
and DataFrame.any()
In [27]: n = 10
In [28]: df = DataFrame(randint(4, size=(n, 2)), columns=list('ab'))
In [29]: df
Out[29]:
a b
0 0 0
1 1 1
2 1 1
3 2 3
4 2 3
5 0 2
6 1 2
7 3 0
8 1 1
9 2 2
[10 rows x 2 columns]
In [30]: df.isin([1, 2])
Out[30]:
a b
0 False False
1 True True
2 True True
3 True False
4 True False
5 False True
6 True True
7 False False
8 True True
9 True True
[10 rows x 2 columns]
In [31]: df.isin([1, 2]).any(1)
Out[31]:
0 False
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 True
9 True
dtype: bool
In [32]: df.loc[df.isin([1, 2]).any(1)]
Out[32]:
a b
1 1 1
2 1 1
3 2 3
4 2 3
5 0 2
6 1 2
8 1 1
9 2 2
[8 rows x 2 columns]
Easiest way to do this
if this helpful hit up arrow! Tahnks!!
students = [ ('jack1', 'Apples1' , 341) ,
('Riti1', 'Mangos1' , 311) ,
('Aadi1', 'Grapes1' , 301) ,
('Sonia1', 'Apples1', 321) ,
('Lucy1', 'Mangos1' , 331) ,
('Mike1', 'Apples1' , 351),
('Mik', 'Apples1' , np.nan)
]
#Create a DataFrame object
df = pd.DataFrame(students, columns = ['Name1' , 'Product1', 'Sale1'])
print(df)
Name1 Product1 Sale1
0 jack1 Apples1 341
1 Riti1 Mangos1 311
2 Aadi1 Grapes1 301
3 Sonia1 Apples1 321
4 Lucy1 Mangos1 331
5 Mike1 Apples1 351
6 Mik Apples1 NaN
# Select rows in above DataFrame for which ‘Product’ column contains the value ‘Apples’,
subset = df[df['Product1'] == 'Apples1']
print(subset)
Name1 Product1 Sale1
0 jack1 Apples1 341
3 Sonia1 Apples1 321
5 Mike1 Apples1 351
6 Mik Apples1 NA
# Select rows in above DataFrame for which ‘Product’ column contains the value ‘Apples’, AND notnull value in Sale
subsetx= df[(df['Product1'] == "Apples1") & (df['Sale1'].notnull())]
print(subsetx)
Name1 Product1 Sale1
0 jack1 Apples1 341
3 Sonia1 Apples1 321
5 Mike1 Apples1 351
# Select rows in above DataFrame for which ‘Product’ column contains the value ‘Apples’, AND Sale = 351
subsetx= df[(df['Product1'] == "Apples1") & (df['Sale1'] == 351)]
print(subsetx)
Name1 Product1 Sale1
5 Mike1 Apples1 351
# Another example
subsetData = df[df['Product1'].isin(['Mangos1', 'Grapes1']) ]
print(subsetData)
Name1 Product1 Sale1
1 Riti1 Mangos1 311
2 Aadi1 Grapes1 301
4 Lucy1 Mangos1 331
Here is the Original link I found this. I edit it a little bit — https://thispointer.com/python-pandas-select-rows-in-dataframe-by-conditions-on-multiple-columns/
All the considerations made by @EdChum in 2014 are still valid, but the pandas.Dataframe.ix
method is deprecated from the version 0.0.20 of pandas. Directly from the docs:
Warning: Starting in 0.20.0, the .ix indexer is deprecated, in favor
of the more strict .iloc and .loc indexers.
In subsequent versions of pandas, this method has been replaced by new indexing methods pandas.Dataframe.loc and pandas.Dataframe.iloc.
If you want to learn more, in this post you can find comparisons between the methods mentioned above.
Ultimately, to date (and there does not seem to be any change in the upcoming versions of pandas from this point of view), the answer to this question is as follows:
foo = df.loc[(df['column1']==value) | (df['columns2'] == 'b') | (df['column3'] == 'c')]
the query() method can do that very intuitively. Express your condition in a string to be evaluated like the following example :
df = df.query("columnNameA <= @x or columnNameB == @y")
with x and y are declared variables which you can refer to with @
I have a pandas df and would like to accomplish something along these lines (in SQL terms):
SELECT * FROM df WHERE column1 = 'a' OR column2 = 'b' OR column3 = 'c' etc.
Now this works, for one column/value pair:
foo = df.loc[df['column']==value]
However, I’m not sure how to expand that to multiple column/value pairs.
- To be clear, each column matches a different value.
You need to enclose multiple conditions in braces due to operator precedence and use the bitwise and (&
) and or (|
) operators:
foo = df[(df['column1']==value) | (df['columns2'] == 'b') | (df['column3'] == 'c')]
If you use and
or or
, then pandas is likely to moan that the comparison is ambiguous. In that case, it is unclear whether we are comparing every value in a series in the condition, and what does it mean if only 1 or all but 1 match the condition. That is why you should use the bitwise operators or the numpy np.all
or np.any
to specify the matching criteria.
There is also the query method: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.query.html
but there are some limitations mainly to do with issues where there could be ambiguity between column names and index values.
A more concise–but not necessarily faster–method is to use DataFrame.isin()
and DataFrame.any()
In [27]: n = 10
In [28]: df = DataFrame(randint(4, size=(n, 2)), columns=list('ab'))
In [29]: df
Out[29]:
a b
0 0 0
1 1 1
2 1 1
3 2 3
4 2 3
5 0 2
6 1 2
7 3 0
8 1 1
9 2 2
[10 rows x 2 columns]
In [30]: df.isin([1, 2])
Out[30]:
a b
0 False False
1 True True
2 True True
3 True False
4 True False
5 False True
6 True True
7 False False
8 True True
9 True True
[10 rows x 2 columns]
In [31]: df.isin([1, 2]).any(1)
Out[31]:
0 False
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 True
9 True
dtype: bool
In [32]: df.loc[df.isin([1, 2]).any(1)]
Out[32]:
a b
1 1 1
2 1 1
3 2 3
4 2 3
5 0 2
6 1 2
8 1 1
9 2 2
[8 rows x 2 columns]
Easiest way to do this
if this helpful hit up arrow! Tahnks!!
students = [ ('jack1', 'Apples1' , 341) ,
('Riti1', 'Mangos1' , 311) ,
('Aadi1', 'Grapes1' , 301) ,
('Sonia1', 'Apples1', 321) ,
('Lucy1', 'Mangos1' , 331) ,
('Mike1', 'Apples1' , 351),
('Mik', 'Apples1' , np.nan)
]
#Create a DataFrame object
df = pd.DataFrame(students, columns = ['Name1' , 'Product1', 'Sale1'])
print(df)
Name1 Product1 Sale1
0 jack1 Apples1 341
1 Riti1 Mangos1 311
2 Aadi1 Grapes1 301
3 Sonia1 Apples1 321
4 Lucy1 Mangos1 331
5 Mike1 Apples1 351
6 Mik Apples1 NaN
# Select rows in above DataFrame for which ‘Product’ column contains the value ‘Apples’,
subset = df[df['Product1'] == 'Apples1']
print(subset)
Name1 Product1 Sale1
0 jack1 Apples1 341
3 Sonia1 Apples1 321
5 Mike1 Apples1 351
6 Mik Apples1 NA
# Select rows in above DataFrame for which ‘Product’ column contains the value ‘Apples’, AND notnull value in Sale
subsetx= df[(df['Product1'] == "Apples1") & (df['Sale1'].notnull())]
print(subsetx)
Name1 Product1 Sale1
0 jack1 Apples1 341
3 Sonia1 Apples1 321
5 Mike1 Apples1 351
# Select rows in above DataFrame for which ‘Product’ column contains the value ‘Apples’, AND Sale = 351
subsetx= df[(df['Product1'] == "Apples1") & (df['Sale1'] == 351)]
print(subsetx)
Name1 Product1 Sale1
5 Mike1 Apples1 351
# Another example
subsetData = df[df['Product1'].isin(['Mangos1', 'Grapes1']) ]
print(subsetData)
Name1 Product1 Sale1
1 Riti1 Mangos1 311
2 Aadi1 Grapes1 301
4 Lucy1 Mangos1 331
Here is the Original link I found this. I edit it a little bit — https://thispointer.com/python-pandas-select-rows-in-dataframe-by-conditions-on-multiple-columns/
All the considerations made by @EdChum in 2014 are still valid, but the pandas.Dataframe.ix
method is deprecated from the version 0.0.20 of pandas. Directly from the docs:
Warning: Starting in 0.20.0, the .ix indexer is deprecated, in favor
of the more strict .iloc and .loc indexers.
In subsequent versions of pandas, this method has been replaced by new indexing methods pandas.Dataframe.loc and pandas.Dataframe.iloc.
If you want to learn more, in this post you can find comparisons between the methods mentioned above.
Ultimately, to date (and there does not seem to be any change in the upcoming versions of pandas from this point of view), the answer to this question is as follows:
foo = df.loc[(df['column1']==value) | (df['columns2'] == 'b') | (df['column3'] == 'c')]
the query() method can do that very intuitively. Express your condition in a string to be evaluated like the following example :
df = df.query("columnNameA <= @x or columnNameB == @y")
with x and y are declared variables which you can refer to with @