Get list of pandas dataframe columns based on data type
Question:
If I have a dataframe with the following columns:
1. NAME object
2. On_Time object
3. On_Budget object
4. %actual_hr float64
5. Baseline Start Date datetime64[ns]
6. Forecast Start Date datetime64[ns]
I would like to be able to say: for this dataframe, give me a list of the columns which are of type ‘object’ or of type ‘datetime’?
I have a function which converts numbers (‘float64’) to two decimal places, and I would like to use this list of dataframe columns, of a particular type, and run it through this function to convert them all to 2dp.
Maybe something like:
For c in col_list: if c.dtype = "Something"
list[]
List.append(c)?
Answers:
You can use boolean mask on the dtypes attribute:
In [11]: df = pd.DataFrame([[1, 2.3456, 'c']])
In [12]: df.dtypes
Out[12]:
0 int64
1 float64
2 object
dtype: object
In [13]: msk = df.dtypes == np.float64 # or object, etc.
In [14]: msk
Out[14]:
0 False
1 True
2 False
dtype: bool
You can look at just those columns with the desired dtype:
In [15]: df.loc[:, msk]
Out[15]:
1
0 2.3456
Now you can use round (or whatever) and assign it back:
In [16]: np.round(df.loc[:, msk], 2)
Out[16]:
1
0 2.35
In [17]: df.loc[:, msk] = np.round(df.loc[:, msk], 2)
In [18]: df
Out[18]:
0 1 2
0 1 2.35 c
If you want a list of columns of a certain type, you can use groupby
:
>>> df = pd.DataFrame([[1, 2.3456, 'c', 'd', 78]], columns=list("ABCDE"))
>>> df
A B C D E
0 1 2.3456 c d 78
[1 rows x 5 columns]
>>> df.dtypes
A int64
B float64
C object
D object
E int64
dtype: object
>>> g = df.columns.to_series().groupby(df.dtypes).groups
>>> g
{dtype('int64'): ['A', 'E'], dtype('float64'): ['B'], dtype('O'): ['C', 'D']}
>>> {k.name: v for k, v in g.items()}
{'object': ['C', 'D'], 'int64': ['A', 'E'], 'float64': ['B']}
As of pandas v0.14.1, you can utilize select_dtypes()
to select columns by dtype
In [2]: df = pd.DataFrame({'NAME': list('abcdef'),
'On_Time': [True, False] * 3,
'On_Budget': [False, True] * 3})
In [3]: df.select_dtypes(include=['bool'])
Out[3]:
On_Budget On_Time
0 False True
1 True False
2 False True
3 True False
4 False True
5 True False
In [4]: mylist = list(df.select_dtypes(include=['bool']).columns)
In [5]: mylist
Out[5]: ['On_Budget', 'On_Time']
If you want a list of only the object columns you could do:
non_numerics = [x for x in df.columns
if not (df[x].dtype == np.float64
or df[x].dtype == np.int64)]
and then if you want to get another list of only the numerics:
numerics = [x for x in df.columns if x not in non_numerics]
Using dtype
will give you desired column’s data type:
dataframe['column1'].dtype
if you want to know data types of all the column at once, you can use plural of dtype
as dtypes:
dataframe.dtypes
list(df.select_dtypes(['object']).columns)
This should do the trick
use df.info(verbose=True)
where df
is a pandas datafarme, by default verbose=False
The most direct way to get a list of columns of certain dtype e.g. ‘object’:
df.select_dtypes(include='object').columns
For example:
>>df = pd.DataFrame([[1, 2.3456, 'c', 'd', 78]], columns=list("ABCDE"))
>>df.dtypes
A int64
B float64
C object
D object
E int64
dtype: object
To get all ‘object’ dtype columns:
>>df.select_dtypes(include='object').columns
Index(['C', 'D'], dtype='object')
For just the list:
>>list(df.select_dtypes(include='object').columns)
['C', 'D']
I came up with this three liner.
Essentially, here’s what it does:
- Fetch the column names and their respective data types.
- I am optionally outputting it to a csv.
inp = pd.read_csv('filename.csv') # read input. Add read_csv arguments as needed
columns = pd.DataFrame({'column_names': inp.columns, 'datatypes': inp.dtypes})
columns.to_csv(inp+'columns_list.csv', encoding='utf-8') # encoding is optional
This made my life much easier in trying to generate schemas on the fly. Hope this helps
for yoshiserry;
def col_types(x,pd):
dtypes=x.dtypes
dtypes_col=dtypes.index
dtypes_type=dtypes.value
column_types=dict(zip(dtypes_col,dtypes_type))
return column_types
I use infer_objects()
Docstring: Attempt to infer better dtypes for object columns.
Attempts soft conversion of object-dtyped columns, leaving non-object
and unconvertible columns unchanged. The inference rules are the same
as during normal Series/DataFrame construction.
df.infer_objects().dtypes
If after 6 years you still have the issue, this should solve it 🙂
cols = [c for c in df.columns if df[c].dtype in ['object', 'datetime64[ns]']]
df = pd.DataFrame({'float': [1.0],
'int': [1],
'bool_1': [False],
'datetime': [pd.Timestamp('20180310')],
'bool_2': [True],
'string': ['foo']})
df.dtypes
# float float64
# int int64
# bool_1 bool
# datetime datetime64[ns]
# bool_2 bool
# string object
# dtype: object
[column for column, is_type in (df.dtypes==bool).items() if is_type]
# ['bool_1', 'bool_2']
Many of the posted solutions use df.select_dtypes which unnecessarily creates a temporary intermediate dataframe. If all you want is "a list of the columns which are of" non-numeric (not float32/int64/complex128/etc.) types, just do one of these (remove the "not" if you do want just the numeric types):
import numpy as np
[c for c in df.columns if not np.issubdtype(df[c].dtype, np.number)]
from pandas.api.types import is_numeric_dtype
[c for c in df.columns if not is_numeric_dtype(c)]
Note: if you want to distinguish floating (float32/float64) from integer and complex then you could use np.floating instead of np.number in the first of the two solutions above or in the first of the two just below.
If you want the result to be a pd.Index rather than just a list of column name strings as above, here are two ways (first is based on @juanpa.arrivillaga):
import numpy as np
df.columns[[not np.issubdtype(dt, np.number) for dt in df.dtypes]]
from pandas.api.types import is_numeric_dtype
df.columns[[not is_numeric_dtype(c) for c in df.columns]]
Some other methods may consider a bool column to be numeric, but the solutions above do not (tested with numpy 1.22.3 / pandas 1.4.2).
If I have a dataframe with the following columns:
1. NAME object
2. On_Time object
3. On_Budget object
4. %actual_hr float64
5. Baseline Start Date datetime64[ns]
6. Forecast Start Date datetime64[ns]
I would like to be able to say: for this dataframe, give me a list of the columns which are of type ‘object’ or of type ‘datetime’?
I have a function which converts numbers (‘float64’) to two decimal places, and I would like to use this list of dataframe columns, of a particular type, and run it through this function to convert them all to 2dp.
Maybe something like:
For c in col_list: if c.dtype = "Something"
list[]
List.append(c)?
You can use boolean mask on the dtypes attribute:
In [11]: df = pd.DataFrame([[1, 2.3456, 'c']])
In [12]: df.dtypes
Out[12]:
0 int64
1 float64
2 object
dtype: object
In [13]: msk = df.dtypes == np.float64 # or object, etc.
In [14]: msk
Out[14]:
0 False
1 True
2 False
dtype: bool
You can look at just those columns with the desired dtype:
In [15]: df.loc[:, msk]
Out[15]:
1
0 2.3456
Now you can use round (or whatever) and assign it back:
In [16]: np.round(df.loc[:, msk], 2)
Out[16]:
1
0 2.35
In [17]: df.loc[:, msk] = np.round(df.loc[:, msk], 2)
In [18]: df
Out[18]:
0 1 2
0 1 2.35 c
If you want a list of columns of a certain type, you can use groupby
:
>>> df = pd.DataFrame([[1, 2.3456, 'c', 'd', 78]], columns=list("ABCDE"))
>>> df
A B C D E
0 1 2.3456 c d 78
[1 rows x 5 columns]
>>> df.dtypes
A int64
B float64
C object
D object
E int64
dtype: object
>>> g = df.columns.to_series().groupby(df.dtypes).groups
>>> g
{dtype('int64'): ['A', 'E'], dtype('float64'): ['B'], dtype('O'): ['C', 'D']}
>>> {k.name: v for k, v in g.items()}
{'object': ['C', 'D'], 'int64': ['A', 'E'], 'float64': ['B']}
As of pandas v0.14.1, you can utilize select_dtypes()
to select columns by dtype
In [2]: df = pd.DataFrame({'NAME': list('abcdef'),
'On_Time': [True, False] * 3,
'On_Budget': [False, True] * 3})
In [3]: df.select_dtypes(include=['bool'])
Out[3]:
On_Budget On_Time
0 False True
1 True False
2 False True
3 True False
4 False True
5 True False
In [4]: mylist = list(df.select_dtypes(include=['bool']).columns)
In [5]: mylist
Out[5]: ['On_Budget', 'On_Time']
If you want a list of only the object columns you could do:
non_numerics = [x for x in df.columns
if not (df[x].dtype == np.float64
or df[x].dtype == np.int64)]
and then if you want to get another list of only the numerics:
numerics = [x for x in df.columns if x not in non_numerics]
Using dtype
will give you desired column’s data type:
dataframe['column1'].dtype
if you want to know data types of all the column at once, you can use plural of dtype
as dtypes:
dataframe.dtypes
list(df.select_dtypes(['object']).columns)
This should do the trick
use df.info(verbose=True)
where df
is a pandas datafarme, by default verbose=False
The most direct way to get a list of columns of certain dtype e.g. ‘object’:
df.select_dtypes(include='object').columns
For example:
>>df = pd.DataFrame([[1, 2.3456, 'c', 'd', 78]], columns=list("ABCDE"))
>>df.dtypes
A int64
B float64
C object
D object
E int64
dtype: object
To get all ‘object’ dtype columns:
>>df.select_dtypes(include='object').columns
Index(['C', 'D'], dtype='object')
For just the list:
>>list(df.select_dtypes(include='object').columns)
['C', 'D']
I came up with this three liner.
Essentially, here’s what it does:
- Fetch the column names and their respective data types.
- I am optionally outputting it to a csv.
inp = pd.read_csv('filename.csv') # read input. Add read_csv arguments as needed
columns = pd.DataFrame({'column_names': inp.columns, 'datatypes': inp.dtypes})
columns.to_csv(inp+'columns_list.csv', encoding='utf-8') # encoding is optional
This made my life much easier in trying to generate schemas on the fly. Hope this helps
for yoshiserry;
def col_types(x,pd):
dtypes=x.dtypes
dtypes_col=dtypes.index
dtypes_type=dtypes.value
column_types=dict(zip(dtypes_col,dtypes_type))
return column_types
I use infer_objects()
Docstring: Attempt to infer better dtypes for object columns.
Attempts soft conversion of object-dtyped columns, leaving non-object
and unconvertible columns unchanged. The inference rules are the same
as during normal Series/DataFrame construction.
df.infer_objects().dtypes
If after 6 years you still have the issue, this should solve it 🙂
cols = [c for c in df.columns if df[c].dtype in ['object', 'datetime64[ns]']]
df = pd.DataFrame({'float': [1.0],
'int': [1],
'bool_1': [False],
'datetime': [pd.Timestamp('20180310')],
'bool_2': [True],
'string': ['foo']})
df.dtypes
# float float64
# int int64
# bool_1 bool
# datetime datetime64[ns]
# bool_2 bool
# string object
# dtype: object
[column for column, is_type in (df.dtypes==bool).items() if is_type]
# ['bool_1', 'bool_2']
Many of the posted solutions use df.select_dtypes which unnecessarily creates a temporary intermediate dataframe. If all you want is "a list of the columns which are of" non-numeric (not float32/int64/complex128/etc.) types, just do one of these (remove the "not" if you do want just the numeric types):
import numpy as np
[c for c in df.columns if not np.issubdtype(df[c].dtype, np.number)]
from pandas.api.types import is_numeric_dtype
[c for c in df.columns if not is_numeric_dtype(c)]
Note: if you want to distinguish floating (float32/float64) from integer and complex then you could use np.floating instead of np.number in the first of the two solutions above or in the first of the two just below.
If you want the result to be a pd.Index rather than just a list of column name strings as above, here are two ways (first is based on @juanpa.arrivillaga):
import numpy as np
df.columns[[not np.issubdtype(dt, np.number) for dt in df.dtypes]]
from pandas.api.types import is_numeric_dtype
df.columns[[not is_numeric_dtype(c) for c in df.columns]]
Some other methods may consider a bool column to be numeric, but the solutions above do not (tested with numpy 1.22.3 / pandas 1.4.2).