How to select all columns whose names start with X in a pandas DataFrame
Question:
I have a DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'foo.aa': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
'foo.fighters': [0, 1, np.nan, 0, 0, 0],
'foo.bars': [0, 0, 0, 0, 0, 1],
'bar.baz': [5, 5, 6, 5, 5.6, 6.8],
'foo.fox': [2, 4, 1, 0, 0, 5],
'nas.foo': ['NA', 0, 1, 0, 0, 0],
'foo.manchu': ['NA', 0, 0, 0, 0, 0],})
I want to select values of 1 in columns starting with foo.
. Is there a better way to do it other than:
df2 = df[(df['foo.aa'] == 1)|
(df['foo.fighters'] == 1)|
(df['foo.bars'] == 1)|
(df['foo.fox'] == 1)|
(df['foo.manchu'] == 1)
]
Something similar to writing something like:
df2= df[df.STARTS_WITH_FOO == 1]
The answer should print out a DataFrame like this:
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 5.0 1.0 0 0 2 NA NA
1 5.0 2.1 0 1 4 0 0
2 6.0 NaN 0 NaN 1 0 1
5 6.8 6.8 1 0 5 0 0
[4 rows x 7 columns]
Answers:
Just perform a list comprehension to create your columns:
In [28]:
filter_col = [col for col in df if col.startswith('foo')]
filter_col
Out[28]:
['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
In [29]:
df[filter_col]
Out[29]:
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
3 4.7 0 0 0 0
4 5.6 0 0 0 0
5 6.8 1 0 5 0
Another method is to create a series from the columns and use the vectorised str method startswith
:
In [33]:
df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
Out[33]:
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
3 4.7 0 0 0 0
4 5.6 0 0 0 0
5 6.8 1 0 5 0
In order to achieve what you want you need to add the following to filter the values that don’t meet your ==1
criteria:
In [36]:
df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
Out[36]:
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 NaN 1 NaN NaN NaN NaN NaN
1 NaN NaN NaN 1 NaN NaN NaN
2 NaN NaN NaN NaN 1 NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN 1 NaN NaN NaN NaN
EDIT
OK after seeing what you want the convoluted answer is this:
In [72]:
df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
Out[72]:
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 5.0 1.0 0 0 2 NA NA
1 5.0 2.1 0 1 4 0 0
2 6.0 NaN 0 NaN 1 0 1
5 6.8 6.8 1 0 5 0 0
Now that pandas’ indexes support string operations, arguably the simplest and best way to select columns beginning with ‘foo’ is just:
df.loc[:, df.columns.str.startswith('foo')]
Alternatively, you can filter column (or row) labels with df.filter()
. To specify a regular expression to match the names beginning with foo.
:
>>> df.filter(regex=r'^foo.', axis=1)
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
3 4.7 0 0 0 0
4 5.6 0 0 0 0
5 6.8 1 0 5 0
To select only the required rows (containing a 1
) and the columns, you can use loc
, selecting the columns using filter
(or any other method) and the rows using any
:
>>> df.loc[(df == 1).any(axis=1), df.filter(regex=r'^foo.', axis=1).columns]
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
5 6.8 1 0 5 0
My solution. It may be slower on performance:
a = pd.concat(df[df[c] == 1] for c in df.columns if c.startswith('foo'))
a.sort_index()
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 5.0 1.0 0 0 2 NA NA
1 5.0 2.1 0 1 4 0 0
2 6.0 NaN 0 NaN 1 0 1
5 6.8 6.8 1 0 5 0 0
Another option for the selection of the desired entries is to use map
:
df.loc[(df == 1).any(axis=1), df.columns.map(lambda x: x.startswith('foo'))]
which gives you all the columns for rows that contain a 1
:
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
5 6.8 1 0 5 0
The row selection is done by
(df == 1).any(axis=1)
as in @ajcr’s answer which gives you:
0 True
1 True
2 True
3 False
4 False
5 True
dtype: bool
meaning that row 3
and 4
do not contain a 1
and won’t be selected.
The selection of the columns is done using Boolean indexing like this:
df.columns.map(lambda x: x.startswith('foo'))
In the example above this returns
array([False, True, True, True, True, True, False], dtype=bool)
So, if a column does not start with foo
, False
is returned and the column is therefore not selected.
If you just want to return all rows that contain a 1
– as your desired output suggests – you can simply do
df.loc[(df == 1).any(axis=1)]
which returns
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 5.0 1.0 0 0 2 NA NA
1 5.0 2.1 0 1 4 0 0
2 6.0 NaN 0 NaN 1 0 1
5 6.8 6.8 1 0 5 0 0
Based on @EdChum’s answer, you can try the following solution:
df[df.columns[pd.Series(df.columns).str.contains("foo")]]
This will be really helpful in case not all the columns you want to select start with foo
. This method selects all the columns that contain the substring foo
and it could be placed in at any point of a column’s name.
In essence, I replaced .startswith()
with .contains()
.
The simplest way is to use str directly on column names, there is no need for pd.Series
df.loc[:,df.columns.str.startswith("foo")]
You can try the regex here to filter out the columns starting with “foo”
df.filter(regex='^foo*')
If you need to have the string foo in your column then
df.filter(regex='foo*')
would be appropriate.
For the next step, you can use
df[df.filter(regex='^foo*').values==1]
to filter out the rows where one of the values of ‘foo*’ column is 1.
In my case I needed a list of prefixes
colsToScale=["production", "test", "development"]
dc[dc.columns[dc.columns.str.startswith(tuple(colsToScale))]]
I do not like that other solutions require us to refer to the DataFrame twice; it might be fine if you have only one frame named df
, but this is often not the case (and your actual name might be much longer). Let’s abuse pandas indexing capabilities to type less, and make the code more readable. There is nothing stopping us from using something like this:
df.loc[:, columns.startswith('foo')]
Because the indexer can be any Callable
. We can then even assign this pseudo-indexer to a variable and use it for multiple frames:
foo_columns = columns.startswith('foo')
df_1.loc[:, foo_columns]
df_2.loc[:, foo_columns]
We can even make it pretty-print:
> foo_columns
<function __main__.PandasIndexer:columns.str.startswith(pat='foo')()>
And we can use any other method of the str
accessor, e.g. columns.contains(r'bard', regex=True)
, all while getting useful signatures:
> columns.contains
<function __main__.PandasIndexer:columns.str.contains(pat, case=True, flags=0, na=None, regex=True)>
All with this short magic code:
from pandas import Series
from inspect import signature, Signature
class PandasIndexer:
def __init__(self, axis_name, accessor='str'):
"""
Args:
- axis_name: `columns` or `index`
- accessor: e.g. `str`, or `dt`
"""
self._axis_name = axis_name
self._accessor = accessor
self._dummy_series = Series(dtype=object)
def _create_indexer(self, attribute):
dummy_accessor = getattr(self._dummy_series, self._accessor)
dummy_attr = getattr(dummy_accessor, attribute)
name = f'PandasIndexer:{self._axis_name}.{self._accessor}.{attribute}'
def indexer_factory(*args, **kwargs):
def indexer(df):
axis = getattr(df, self._axis_name)
accessor = getattr(axis, self._accessor)
method = getattr(accessor, attribute)
return method(*args, **kwargs)
bound_arguments = signature(dummy_attr).bind(*args, **kwargs)
indexer.__qualname__ = (
name + str(bound_arguments).replace('<BoundArguments ', '')[:-1]
)
indexer.__signature__ = Signature()
return indexer
indexer_factory.__name__ = name
indexer_factory.__qualname__ = name
indexer_factory.__signature__ = signature(dummy_attr)
return indexer_factory
def __getattr__(self, attribute):
return self._create_indexer(attribute)
def __dir__(self):
"""Make it work with auto-complete in IPython"""
return dir(getattr(self._dummy_series, self._accessor))
columns = PandasIndexer('columns')
You can use the method filter
with the parameter like
:
df.filter(like='foo')
Even you can try this for multiple prefix:
temp = df.loc[:, df.columns.str.startswith(('prefix1','prefix2','prefix3'))]
I have a DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'foo.aa': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
'foo.fighters': [0, 1, np.nan, 0, 0, 0],
'foo.bars': [0, 0, 0, 0, 0, 1],
'bar.baz': [5, 5, 6, 5, 5.6, 6.8],
'foo.fox': [2, 4, 1, 0, 0, 5],
'nas.foo': ['NA', 0, 1, 0, 0, 0],
'foo.manchu': ['NA', 0, 0, 0, 0, 0],})
I want to select values of 1 in columns starting with foo.
. Is there a better way to do it other than:
df2 = df[(df['foo.aa'] == 1)|
(df['foo.fighters'] == 1)|
(df['foo.bars'] == 1)|
(df['foo.fox'] == 1)|
(df['foo.manchu'] == 1)
]
Something similar to writing something like:
df2= df[df.STARTS_WITH_FOO == 1]
The answer should print out a DataFrame like this:
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 5.0 1.0 0 0 2 NA NA
1 5.0 2.1 0 1 4 0 0
2 6.0 NaN 0 NaN 1 0 1
5 6.8 6.8 1 0 5 0 0
[4 rows x 7 columns]
Just perform a list comprehension to create your columns:
In [28]:
filter_col = [col for col in df if col.startswith('foo')]
filter_col
Out[28]:
['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
In [29]:
df[filter_col]
Out[29]:
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
3 4.7 0 0 0 0
4 5.6 0 0 0 0
5 6.8 1 0 5 0
Another method is to create a series from the columns and use the vectorised str method startswith
:
In [33]:
df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
Out[33]:
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
3 4.7 0 0 0 0
4 5.6 0 0 0 0
5 6.8 1 0 5 0
In order to achieve what you want you need to add the following to filter the values that don’t meet your ==1
criteria:
In [36]:
df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
Out[36]:
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 NaN 1 NaN NaN NaN NaN NaN
1 NaN NaN NaN 1 NaN NaN NaN
2 NaN NaN NaN NaN 1 NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN 1 NaN NaN NaN NaN
EDIT
OK after seeing what you want the convoluted answer is this:
In [72]:
df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
Out[72]:
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 5.0 1.0 0 0 2 NA NA
1 5.0 2.1 0 1 4 0 0
2 6.0 NaN 0 NaN 1 0 1
5 6.8 6.8 1 0 5 0 0
Now that pandas’ indexes support string operations, arguably the simplest and best way to select columns beginning with ‘foo’ is just:
df.loc[:, df.columns.str.startswith('foo')]
Alternatively, you can filter column (or row) labels with df.filter()
. To specify a regular expression to match the names beginning with foo.
:
>>> df.filter(regex=r'^foo.', axis=1)
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
3 4.7 0 0 0 0
4 5.6 0 0 0 0
5 6.8 1 0 5 0
To select only the required rows (containing a 1
) and the columns, you can use loc
, selecting the columns using filter
(or any other method) and the rows using any
:
>>> df.loc[(df == 1).any(axis=1), df.filter(regex=r'^foo.', axis=1).columns]
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
5 6.8 1 0 5 0
My solution. It may be slower on performance:
a = pd.concat(df[df[c] == 1] for c in df.columns if c.startswith('foo'))
a.sort_index()
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 5.0 1.0 0 0 2 NA NA
1 5.0 2.1 0 1 4 0 0
2 6.0 NaN 0 NaN 1 0 1
5 6.8 6.8 1 0 5 0 0
Another option for the selection of the desired entries is to use map
:
df.loc[(df == 1).any(axis=1), df.columns.map(lambda x: x.startswith('foo'))]
which gives you all the columns for rows that contain a 1
:
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
5 6.8 1 0 5 0
The row selection is done by
(df == 1).any(axis=1)
as in @ajcr’s answer which gives you:
0 True
1 True
2 True
3 False
4 False
5 True
dtype: bool
meaning that row 3
and 4
do not contain a 1
and won’t be selected.
The selection of the columns is done using Boolean indexing like this:
df.columns.map(lambda x: x.startswith('foo'))
In the example above this returns
array([False, True, True, True, True, True, False], dtype=bool)
So, if a column does not start with foo
, False
is returned and the column is therefore not selected.
If you just want to return all rows that contain a 1
– as your desired output suggests – you can simply do
df.loc[(df == 1).any(axis=1)]
which returns
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 5.0 1.0 0 0 2 NA NA
1 5.0 2.1 0 1 4 0 0
2 6.0 NaN 0 NaN 1 0 1
5 6.8 6.8 1 0 5 0 0
Based on @EdChum’s answer, you can try the following solution:
df[df.columns[pd.Series(df.columns).str.contains("foo")]]
This will be really helpful in case not all the columns you want to select start with foo
. This method selects all the columns that contain the substring foo
and it could be placed in at any point of a column’s name.
In essence, I replaced .startswith()
with .contains()
.
The simplest way is to use str directly on column names, there is no need for pd.Series
df.loc[:,df.columns.str.startswith("foo")]
You can try the regex here to filter out the columns starting with “foo”
df.filter(regex='^foo*')
If you need to have the string foo in your column then
df.filter(regex='foo*')
would be appropriate.
For the next step, you can use
df[df.filter(regex='^foo*').values==1]
to filter out the rows where one of the values of ‘foo*’ column is 1.
In my case I needed a list of prefixes
colsToScale=["production", "test", "development"]
dc[dc.columns[dc.columns.str.startswith(tuple(colsToScale))]]
I do not like that other solutions require us to refer to the DataFrame twice; it might be fine if you have only one frame named df
, but this is often not the case (and your actual name might be much longer). Let’s abuse pandas indexing capabilities to type less, and make the code more readable. There is nothing stopping us from using something like this:
df.loc[:, columns.startswith('foo')]
Because the indexer can be any Callable
. We can then even assign this pseudo-indexer to a variable and use it for multiple frames:
foo_columns = columns.startswith('foo')
df_1.loc[:, foo_columns]
df_2.loc[:, foo_columns]
We can even make it pretty-print:
> foo_columns
<function __main__.PandasIndexer:columns.str.startswith(pat='foo')()>
And we can use any other method of the str
accessor, e.g. columns.contains(r'bard', regex=True)
, all while getting useful signatures:
> columns.contains
<function __main__.PandasIndexer:columns.str.contains(pat, case=True, flags=0, na=None, regex=True)>
All with this short magic code:
from pandas import Series
from inspect import signature, Signature
class PandasIndexer:
def __init__(self, axis_name, accessor='str'):
"""
Args:
- axis_name: `columns` or `index`
- accessor: e.g. `str`, or `dt`
"""
self._axis_name = axis_name
self._accessor = accessor
self._dummy_series = Series(dtype=object)
def _create_indexer(self, attribute):
dummy_accessor = getattr(self._dummy_series, self._accessor)
dummy_attr = getattr(dummy_accessor, attribute)
name = f'PandasIndexer:{self._axis_name}.{self._accessor}.{attribute}'
def indexer_factory(*args, **kwargs):
def indexer(df):
axis = getattr(df, self._axis_name)
accessor = getattr(axis, self._accessor)
method = getattr(accessor, attribute)
return method(*args, **kwargs)
bound_arguments = signature(dummy_attr).bind(*args, **kwargs)
indexer.__qualname__ = (
name + str(bound_arguments).replace('<BoundArguments ', '')[:-1]
)
indexer.__signature__ = Signature()
return indexer
indexer_factory.__name__ = name
indexer_factory.__qualname__ = name
indexer_factory.__signature__ = signature(dummy_attr)
return indexer_factory
def __getattr__(self, attribute):
return self._create_indexer(attribute)
def __dir__(self):
"""Make it work with auto-complete in IPython"""
return dir(getattr(self._dummy_series, self._accessor))
columns = PandasIndexer('columns')
You can use the method filter
with the parameter like
:
df.filter(like='foo')
Even you can try this for multiple prefix:
temp = df.loc[:, df.columns.str.startswith(('prefix1','prefix2','prefix3'))]