Select from pandas dataframe using boolean series/array
Question:
I have a dataframe:
High Low Close
Date
2009-02-11 30.20 29.41 29.87
2009-02-12 30.28 29.32 30.24
2009-02-13 30.45 29.96 30.10
2009-02-17 29.35 28.74 28.90
2009-02-18 29.35 28.56 28.92
and a boolean series:
bools
1 True
2 False
3 False
4 True
5 False
how could I select from the dataframe using the boolean array to obtain result like:
High
Date
2009-02-11 30.20
2009-02-17 29.35
Answers:
For the indexing to work with two DataFrames they have to have comparable indexes. In this case it won’t work because one DataFrame
has an integer index, while the other has dates.
However, as you say you can filter using a bool
array. You can access the array for a Series
via .values
. This can be then applied as a filter as follows:
df # pandas.DataFrame
s # pandas.Series
df[s.values] # df, filtered by the bool array in s
For example, with your data:
import pandas as pd
df = pd.DataFrame([
[30.20, 29.41, 29.87],
[30.28, 29.32, 30.24],
[30.45, 29.96, 30.10],
[29.35, 28.74, 28.90],
[29.35, 28.56, 28.92],
],
columns=['High','Low','Close'],
index=['2009-02-11','2009-02-12','2009-02-13','2009-02-17','2009-02-18']
)
s = pd.Series([True, False, False, True, False], name='bools')
df[s.values]
Returns the following:
High Low Close
2009-02-11 30.20 29.41 29.87
2009-02-17 29.35 28.74 28.90
If you just want the High column, you can filter this as normal (before, or after the bool
filter):
df['High'][s.values]
# Or: df[s.values]['High']
To get your target output (as a Series
):
2009-02-11 30.20
2009-02-17 29.35
Name: High, dtype: float64
This exact example with exactly the same code does not work in 2022 anymore…
df[s.values] returns all the 5 rows, not 2 rows
I have a dataframe:
High Low Close
Date
2009-02-11 30.20 29.41 29.87
2009-02-12 30.28 29.32 30.24
2009-02-13 30.45 29.96 30.10
2009-02-17 29.35 28.74 28.90
2009-02-18 29.35 28.56 28.92
and a boolean series:
bools
1 True
2 False
3 False
4 True
5 False
how could I select from the dataframe using the boolean array to obtain result like:
High
Date
2009-02-11 30.20
2009-02-17 29.35
For the indexing to work with two DataFrames they have to have comparable indexes. In this case it won’t work because one DataFrame
has an integer index, while the other has dates.
However, as you say you can filter using a bool
array. You can access the array for a Series
via .values
. This can be then applied as a filter as follows:
df # pandas.DataFrame
s # pandas.Series
df[s.values] # df, filtered by the bool array in s
For example, with your data:
import pandas as pd
df = pd.DataFrame([
[30.20, 29.41, 29.87],
[30.28, 29.32, 30.24],
[30.45, 29.96, 30.10],
[29.35, 28.74, 28.90],
[29.35, 28.56, 28.92],
],
columns=['High','Low','Close'],
index=['2009-02-11','2009-02-12','2009-02-13','2009-02-17','2009-02-18']
)
s = pd.Series([True, False, False, True, False], name='bools')
df[s.values]
Returns the following:
High Low Close
2009-02-11 30.20 29.41 29.87
2009-02-17 29.35 28.74 28.90
If you just want the High column, you can filter this as normal (before, or after the bool
filter):
df['High'][s.values]
# Or: df[s.values]['High']
To get your target output (as a Series
):
2009-02-11 30.20
2009-02-17 29.35
Name: High, dtype: float64
This exact example with exactly the same code does not work in 2022 anymore…
df[s.values] returns all the 5 rows, not 2 rows