Pandas Select DataFrame columns using boolean
Question:
I want to use a boolean to select the columns with more than 4000 entries from a dataframe comb
which has over 1,000 columns. This expression gives me a Boolean (True/False) result:
criteria = comb.ix[:,'c_0327':].count()>4000
I want to use it to select only the True
columns to a new Dataframe.
The following just gives me “Unalignable boolean Series key provided”:
comb.loc[criteria,]
I also tried:
comb.ix[:, comb.ix[:,'c_0327':].count()>4000]
Similar to this question answer dataframe boolean selection along columns instead of row
but that gives me the same error: “Unalignable boolean Series key provided”
comb.ix[:,'c_0327':].count()>4000
yields:
c_0327 False
c_0328 False
c_0329 False
c_0330 False
c_0331 False
c_0332 False
c_0333 False
c_0334 False
c_0335 False
c_0336 False
c_0337 True
c_0338 False
.....
Answers:
What is returned is a Series with the column names as the index and the boolean values as the row values.
I think actually you want:
this should now work:
comb[criteria.index[criteria]]
Basically this uses the index values from criteria and the boolean values to mask them, this will return an array of column names, we can use this to select the columns of interest from the orig df.
You can also use:
# To filter columns (assuming criteria length is equal to the number of columns of comb)
comb.ix[:, criteria]
comb.iloc[:, criteria.values]
# To filter rows (assuming criteria length is equal to the number of rows of comb)
comb[criteria]
I’m using this, it’s cleaner
comb.values[:,criteria]
In pandas 0.25:
comb.loc[:, criteria]
Returns a DataFrame with columns selected by the Boolean list or Series.
For multiple criteria:
comb.loc[:, criteria1 & criteria2]
And for selecting rows with an index criteria:
comb[criteria]
Note:
The bit-wise operator &
is required (not and
). See Logical operators for boolean indexing in Pandas.
Other Note:
If the criteria is an expression (e.g., comb.columnX > 3
), and multiple criteria are used, remember to enclose each expression in parentheses! This is because &, |
have higher precedence than >, ==, ect.
(whereas and, or
are lower precedence).
Another solution is to transpose comb
to make its columns act as its index, then transpose on the resulting subset:
comb.T[criteria].T
Again, not particularly elegant, but at least shorter/less repetitive than the leading solution.
You can pass a boolean array to loc
to indicate which columns should be kept and which not.
For example,
>>> df
A B C D E
0 73 15 55 33 foo
1 63 64 11 11 bar
2 56 72 57 55 foo
>>> df.loc[:, [True, True, False, False, True]]
A B E
0 73 15 foo
1 63 64 bar
2 56 72 foo
Another approach is to use Python’s built-in filter
function:
def satisfies_criteria(column):
return comb[column].count() > 4000
cols = filter(satisfies_criteria, df.columns)
df[cols]
I want to use a boolean to select the columns with more than 4000 entries from a dataframe comb
which has over 1,000 columns. This expression gives me a Boolean (True/False) result:
criteria = comb.ix[:,'c_0327':].count()>4000
I want to use it to select only the True
columns to a new Dataframe.
The following just gives me “Unalignable boolean Series key provided”:
comb.loc[criteria,]
I also tried:
comb.ix[:, comb.ix[:,'c_0327':].count()>4000]
Similar to this question answer dataframe boolean selection along columns instead of row
but that gives me the same error: “Unalignable boolean Series key provided”
comb.ix[:,'c_0327':].count()>4000
yields:
c_0327 False
c_0328 False
c_0329 False
c_0330 False
c_0331 False
c_0332 False
c_0333 False
c_0334 False
c_0335 False
c_0336 False
c_0337 True
c_0338 False
.....
What is returned is a Series with the column names as the index and the boolean values as the row values.
I think actually you want:
this should now work:
comb[criteria.index[criteria]]
Basically this uses the index values from criteria and the boolean values to mask them, this will return an array of column names, we can use this to select the columns of interest from the orig df.
You can also use:
# To filter columns (assuming criteria length is equal to the number of columns of comb)comb.ix[:, criteria]comb.iloc[:, criteria.values] # To filter rows (assuming criteria length is equal to the number of rows of comb) comb[criteria]
I’m using this, it’s cleaner
comb.values[:,criteria]
In pandas 0.25:
comb.loc[:, criteria]
Returns a DataFrame with columns selected by the Boolean list or Series.
For multiple criteria:
comb.loc[:, criteria1 & criteria2]
And for selecting rows with an index criteria:
comb[criteria]
Note:
The bit-wise operator &
is required (not and
). See Logical operators for boolean indexing in Pandas.
Other Note:
If the criteria is an expression (e.g., comb.columnX > 3
), and multiple criteria are used, remember to enclose each expression in parentheses! This is because &, |
have higher precedence than >, ==, ect.
(whereas and, or
are lower precedence).
Another solution is to transpose comb
to make its columns act as its index, then transpose on the resulting subset:
comb.T[criteria].T
Again, not particularly elegant, but at least shorter/less repetitive than the leading solution.
You can pass a boolean array to loc
to indicate which columns should be kept and which not.
For example,
>>> df
A B C D E
0 73 15 55 33 foo
1 63 64 11 11 bar
2 56 72 57 55 foo
>>> df.loc[:, [True, True, False, False, True]]
A B E
0 73 15 foo
1 63 64 bar
2 56 72 foo
Another approach is to use Python’s built-in filter
function:
def satisfies_criteria(column):
return comb[column].count() > 4000
cols = filter(satisfies_criteria, df.columns)
df[cols]