Find non-numeric values in pandas dataframe column
Question:
I got a a column in a dataframe that contains numbers and strings. So I replaced the strings by numbers via df.column.replace(["A", "B", "C", "D"], [1, 2, 3, 4], inplace=True)
.
But the column is still dtype “object”. I can not sort the column (TypeError error: ‘<‘ not supported between instances of ‘str’ and ‘int’).
Now how can I identify those numbers that are strings? I tried print(df[pd.to_numeric(df['column']).isnull()])
and it gives back an empty dataframe, as expected. However I read that this does not work in my case (actual numbers saved as strings). So how can I identify those numbers saved as a string?
Am I right that if a column only contains REAL numbers (int or float) it will automatically change to dtype int or float?
Thank you!
Answers:
you can change dtype
df.column.dtype=df.column.astype(int)
You can use pd.to_numeric
with something like:
df['column'] = pd.to_numeric(df['column'], errors='coerce')
For the errors
argument you have few option, see reference documentation here
Expanding on Francesco’s answer, it’s possible to create a mask of non-numeric values and identify unique instances to handle or remove.
This uses the fact that where values cant be coerced, they are treated as nulls.
is_non_numeric = pd.to_numeric(df['column'], errors='coerce').isnull()
df[is_non_numeric]['column'].unique()
Or alternatively in a single line:
df[pd.to_numeric(df['column'], errors='coerce').isnull()]['column'].unique()
I got a a column in a dataframe that contains numbers and strings. So I replaced the strings by numbers via df.column.replace(["A", "B", "C", "D"], [1, 2, 3, 4], inplace=True)
.
But the column is still dtype “object”. I can not sort the column (TypeError error: ‘<‘ not supported between instances of ‘str’ and ‘int’).
Now how can I identify those numbers that are strings? I tried print(df[pd.to_numeric(df['column']).isnull()])
and it gives back an empty dataframe, as expected. However I read that this does not work in my case (actual numbers saved as strings). So how can I identify those numbers saved as a string?
Am I right that if a column only contains REAL numbers (int or float) it will automatically change to dtype int or float?
Thank you!
you can change dtype
df.column.dtype=df.column.astype(int)
You can use pd.to_numeric
with something like:
df['column'] = pd.to_numeric(df['column'], errors='coerce')
For the errors
argument you have few option, see reference documentation here
Expanding on Francesco’s answer, it’s possible to create a mask of non-numeric values and identify unique instances to handle or remove.
This uses the fact that where values cant be coerced, they are treated as nulls.
is_non_numeric = pd.to_numeric(df['column'], errors='coerce').isnull()
df[is_non_numeric]['column'].unique()
Or alternatively in a single line:
df[pd.to_numeric(df['column'], errors='coerce').isnull()]['column'].unique()