How to check if a pandas dataframe contains only numeric values column-wise?
Question:
I want to check every column in a dataframe whether it contains only numeric data. Specifically, my query is not about the datatype, but instead, I want to check every value in each column of the dataframe whether it’s a numeric value.
How can I find this out?
Answers:
Let’s say you have a dataframe called df
, if you do:
df.select_dtypes(include=["float", 'int'])
This will return all the numeric columns, you can check if this is the same as the original df
.
Otherwise, you can also use the exclude
parameter:
df.select_dtypes(exclude=["float", 'int'])
and check if this gives you an empty dataframe.
This will return True if all columns are numeric, False otherwise.
df.shape[1] == df.select_dtypes(include=np.number).shape[1]
To select numeric columns:
new_df = df.select_dtypes(include=np.number)
You can draw a True / False comparison using isnumeric()
Example:
>>> df
A B
0 1 1
1 NaN 6
2 NaN NaN
3 2 2
4 NaN NaN
5 4 4
6 some some
7 value other
Results:
>>> df.A.str.isnumeric()
0 True
1 NaN
2 NaN
3 True
4 NaN
5 True
6 False
7 False
Name: A, dtype: object
# df.B.str.isnumeric()
with apply()
method which seems more robust in case you need corner to corner comparison:
DataFrame having two different columns one with mixed type another with numbers only for test:
>>> df
A B
0 1 1
1 NaN 6
2 NaN 33
3 2 2
4 NaN 22
5 4 4
6 some 66
7 value 11
Result:
>>> df.apply(lambda x: x.str.isnumeric())
A B
0 True True
1 NaN True
2 NaN True
3 True True
4 NaN True
5 True True
6 False True
7 False True
Another example:
Let’s consider the below dataframe with different data-types as follows..
>>> df
num rating name age
0 0 80.0 shakir 33
1 1 -22.0 rafiq 37
2 2 -10.0 dev 36
3 num 1.0 suraj 30
Based on the comment from OP on this answer, where it has negative value and 0’s in it.
1- This is a pseudo-internal method to return only the numeric type data.
>>> df._get_numeric_data()
rating age
0 80.0 33
1 -22.0 37
2 -10.0 36
3 1.0 30
OR
2- there is an option to use method select_dtypes
in module pandas.core.frame which return a subset of the DataFrame’s columns based on the column dtypes
. One can use Parameters
with include, exclude
options.
>>> df.select_dtypes(include=['int64','float64']) # choosing int & float
rating age
0 80.0 33
1 -22.0 37
2 -10.0 36
3 1.0 30
>>> df.select_dtypes(include=['int64']) # choose int
age
0 33
1 37
2 36
3 30
You can check that using to_numeric
and coercing errors:
pd.to_numeric(df['column'], errors='coerce').notnull().all()
For all columns, you can iterate through columns or just use apply
df.apply(lambda s: pd.to_numeric(s, errors='coerce').notnull().all())
E.g.
df = pd.DataFrame({'col' : [1,2, 10, np.nan, 'a'],
'col2': ['a', 10, 30, 40 ,50],
'col3': [1,2,3,4,5.0]})
Outputs
col False
col2 False
col3 True
dtype: bool
The accepted answers seem bit overkill, as they sub-select the entire dataframe.
To check types only metadata should be used, which can be done with
pd.api.types.is_numeric_dtype.
import pandas as pd
df = pd.DataFrame(data=[[1,'a']],columns=['numeruc_col','string_col'])
print(df.columns[list(map(pd.api.types.is_numeric_dtype,df.dtypes))]) # one way
print(df.dtypes.map(pd.api.types.is_numeric_dtype)) # another way
To check for numeric columns, you could use df[c].dtype.kind in 'iufcb'
where c
is any given column name. The comparison will yeild a True
or False
boolean output.
It can be iterated through all the column names with a list comprehension:
>>> [(c, df[c].dtype.kind in 'iufcb') for c in df.columns]
[('col', False), ('col2', False), ('col3', True)]
The numpy.dtype.kind 'iufcb'
notation is a representation of whether it is a signed integer (i), unsigned integer (u), float (f), complex number (c), or boolean (b). The string can be modified to exclude any of the above (e.g., 'iufc'
to exclude boolean).
This solves the original question in relation to checking column data types. It also provides the benefits of (1) a shorter line of code which (2) remains sufficiently intuitive to the user.
I want to check every column in a dataframe whether it contains only numeric data. Specifically, my query is not about the datatype, but instead, I want to check every value in each column of the dataframe whether it’s a numeric value.
How can I find this out?
Let’s say you have a dataframe called df
, if you do:
df.select_dtypes(include=["float", 'int'])
This will return all the numeric columns, you can check if this is the same as the original df
.
Otherwise, you can also use the exclude
parameter:
df.select_dtypes(exclude=["float", 'int'])
and check if this gives you an empty dataframe.
This will return True if all columns are numeric, False otherwise.
df.shape[1] == df.select_dtypes(include=np.number).shape[1]
To select numeric columns:
new_df = df.select_dtypes(include=np.number)
You can draw a True / False comparison using isnumeric()
Example:
>>> df
A B
0 1 1
1 NaN 6
2 NaN NaN
3 2 2
4 NaN NaN
5 4 4
6 some some
7 value other
Results:
>>> df.A.str.isnumeric()
0 True
1 NaN
2 NaN
3 True
4 NaN
5 True
6 False
7 False
Name: A, dtype: object
# df.B.str.isnumeric()
with apply()
method which seems more robust in case you need corner to corner comparison:
DataFrame having two different columns one with mixed type another with numbers only for test:
>>> df
A B
0 1 1
1 NaN 6
2 NaN 33
3 2 2
4 NaN 22
5 4 4
6 some 66
7 value 11
Result:
>>> df.apply(lambda x: x.str.isnumeric())
A B
0 True True
1 NaN True
2 NaN True
3 True True
4 NaN True
5 True True
6 False True
7 False True
Another example:
Let’s consider the below dataframe with different data-types as follows..
>>> df
num rating name age
0 0 80.0 shakir 33
1 1 -22.0 rafiq 37
2 2 -10.0 dev 36
3 num 1.0 suraj 30
Based on the comment from OP on this answer, where it has negative value and 0’s in it.
1- This is a pseudo-internal method to return only the numeric type data.
>>> df._get_numeric_data()
rating age
0 80.0 33
1 -22.0 37
2 -10.0 36
3 1.0 30
OR
2- there is an option to use method select_dtypes
in module pandas.core.frame which return a subset of the DataFrame’s columns based on the column dtypes
. One can use Parameters
with include, exclude
options.
>>> df.select_dtypes(include=['int64','float64']) # choosing int & float
rating age
0 80.0 33
1 -22.0 37
2 -10.0 36
3 1.0 30
>>> df.select_dtypes(include=['int64']) # choose int
age
0 33
1 37
2 36
3 30
You can check that using to_numeric
and coercing errors:
pd.to_numeric(df['column'], errors='coerce').notnull().all()
For all columns, you can iterate through columns or just use apply
df.apply(lambda s: pd.to_numeric(s, errors='coerce').notnull().all())
E.g.
df = pd.DataFrame({'col' : [1,2, 10, np.nan, 'a'],
'col2': ['a', 10, 30, 40 ,50],
'col3': [1,2,3,4,5.0]})
Outputs
col False
col2 False
col3 True
dtype: bool
The accepted answers seem bit overkill, as they sub-select the entire dataframe.
To check types only metadata should be used, which can be done with
pd.api.types.is_numeric_dtype.
import pandas as pd
df = pd.DataFrame(data=[[1,'a']],columns=['numeruc_col','string_col'])
print(df.columns[list(map(pd.api.types.is_numeric_dtype,df.dtypes))]) # one way
print(df.dtypes.map(pd.api.types.is_numeric_dtype)) # another way
To check for numeric columns, you could use df[c].dtype.kind in 'iufcb'
where c
is any given column name. The comparison will yeild a True
or False
boolean output.
It can be iterated through all the column names with a list comprehension:
>>> [(c, df[c].dtype.kind in 'iufcb') for c in df.columns]
[('col', False), ('col2', False), ('col3', True)]
The numpy.dtype.kind 'iufcb'
notation is a representation of whether it is a signed integer (i), unsigned integer (u), float (f), complex number (c), or boolean (b). The string can be modified to exclude any of the above (e.g., 'iufc'
to exclude boolean).
This solves the original question in relation to checking column data types. It also provides the benefits of (1) a shorter line of code which (2) remains sufficiently intuitive to the user.