Remove non-numeric rows in one column with pandas
Question:
There is a dataframe like the following, and it has one unclean column ‘id’ which it sholud be numeric column
id, name
1, A
2, B
3, C
tt, D
4, E
5, F
de, G
Is there a concise way to remove the rows because tt and de are not numeric values
tt,D
de,G
to make the dataframe clean?
id, name
1, A
2, B
3, C
4, E
5, F
Answers:
Given that df
is your dataframe,
import numpy as np
df[df['id'].apply(lambda x: isinstance(x, (int, np.int64)))]
What it does is passing each value in the id
column to the isinstance
function and checks if it’s an int
. Then it returns a boolean array, and finally returning only the rows where there is True
.
If you also need to account for float
values, another option is:
import numpy as np
df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]
Note that either way is not inplace, so you will need to reassign it to your original df, or create a new one:
df = df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]
# or
new_df = df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]
You could use standard method of strings isnumeric
and apply it to each value in your id
column:
import pandas as pd
from io import StringIO
data = """
id,name
1,A
2,B
3,C
tt,D
4,E
5,F
de,G
"""
df = pd.read_csv(StringIO(data))
In [55]: df
Out[55]:
id name
0 1 A
1 2 B
2 3 C
3 tt D
4 4 E
5 5 F
6 de G
In [56]: df[df.id.apply(lambda x: x.isnumeric())]
Out[56]:
id name
0 1 A
1 2 B
2 3 C
4 4 E
5 5 F
Or if you want to use id
as index you could do:
In [61]: df[df.id.apply(lambda x: x.isnumeric())].set_index('id')
Out[61]:
name
id
1 A
2 B
3 C
4 E
5 F
Edit. Add timings
Although case with pd.to_numeric
is not using apply
method it is almost two times slower than with applying np.isnumeric
for str
columns. Also I add option with using pandas str.isnumeric
which is less typing and still faster then using pd.to_numeric
. But pd.to_numeric
is more general because it could work with any data types (not only strings).
df_big = pd.concat([df]*10000)
In [3]: df_big = pd.concat([df]*10000)
In [4]: df_big.shape
Out[4]: (70000, 2)
In [5]: %timeit df_big[df_big.id.apply(lambda x: x.isnumeric())]
15.3 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [6]: %timeit df_big[df_big.id.str.isnumeric()]
20.3 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit df_big[pd.to_numeric(df_big['id'], errors='coerce').notnull()]
29.9 ms ± 682 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using pd.to_numeric
In [1079]: df[pd.to_numeric(df['id'], errors='coerce').notnull()]
Out[1079]:
id name
0 1 A
1 2 B
2 3 C
4 4 E
5 5 F
x.isnumeric()
does not test return True
when x
is of type float
.
One way to filter out values which can be converted to float
:
df[df['id'].apply(lambda x: is_float(x))]
def is_float(x):
try:
float(x)
except ValueError:
return False
return True
How about this?
The .str
accessor is one of my favorites 🙂
import pandas as pd
df = pd.DataFrame(
{
'id': {0: '1', 1: '2', 2: '3', 3: 'tt', 4: '4', 5: '5', 6: 'de'},
'name': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G'}
}
)
df_clean = df[df.id.str.isnumeric()]
Supplement (2021-06-22)
If the id
contains some kind of headache-makers (such as float
, None
, nan
), you can forcefully cast them to the str
data type using astype('str')
.
import numpy as np
import pandas as pd
df = pd.DataFrame(
{
'id': {0: '1', 1: '2', 2: '3', 3: 3.14, 4: '4', 5: '5', 6: None, 7: np.nan},
'name': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H'}
}
)
df_clean = df[df.id.astype('str').str.isnumeric()]
Primitive, but it works anyway.
This is a dynamic way to do it, this only works for int64 and float 64, if you have other numeric data types in your dataframe make sure you add them to the if statement
# make dataframe of column data types
col_types = df.dtypes.to_frame()
col_types.columns = ['dtype']
#make list of zeros
drop_it = [0]*col_types.shape[0]
k = 0
#make it a one if the data isn't numeric
#if you have other numeric types you need to add them to if statement
for t in col_types.dtype:
if t != 'int64' and t != 'float64':
drop_it[k] = 1
k = k + 1
#delete types from drop list that aren't numeric
col_types['drop_it'] = drop_it
col_types = col_types.loc[col_types["drop_it"] == 1]
#finally drop columns that are in drop list
for col_to_drop in col_types.index.values.tolist():
df = df.drop([col_to_drop], axis = 1)
Another alternative is to use the query
method:
In [5]: df.query('id.str.isnumeric()')
Out[5]:
id name
0 1 A
1 2 B
2 3 C
4 4 E
5 5 F
System.ObjectDisposedException: ‘Cannot access a disposed object. Object name: ‘IServiceProvider’.’
Try to delete the bin folder and rebuild it again. It works for me.
There is a dataframe like the following, and it has one unclean column ‘id’ which it sholud be numeric column
id, name
1, A
2, B
3, C
tt, D
4, E
5, F
de, G
Is there a concise way to remove the rows because tt and de are not numeric values
tt,D
de,G
to make the dataframe clean?
id, name
1, A
2, B
3, C
4, E
5, F
Given that df
is your dataframe,
import numpy as np
df[df['id'].apply(lambda x: isinstance(x, (int, np.int64)))]
What it does is passing each value in the id
column to the isinstance
function and checks if it’s an int
. Then it returns a boolean array, and finally returning only the rows where there is True
.
If you also need to account for float
values, another option is:
import numpy as np
df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]
Note that either way is not inplace, so you will need to reassign it to your original df, or create a new one:
df = df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]
# or
new_df = df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]
You could use standard method of strings isnumeric
and apply it to each value in your id
column:
import pandas as pd
from io import StringIO
data = """
id,name
1,A
2,B
3,C
tt,D
4,E
5,F
de,G
"""
df = pd.read_csv(StringIO(data))
In [55]: df
Out[55]:
id name
0 1 A
1 2 B
2 3 C
3 tt D
4 4 E
5 5 F
6 de G
In [56]: df[df.id.apply(lambda x: x.isnumeric())]
Out[56]:
id name
0 1 A
1 2 B
2 3 C
4 4 E
5 5 F
Or if you want to use id
as index you could do:
In [61]: df[df.id.apply(lambda x: x.isnumeric())].set_index('id')
Out[61]:
name
id
1 A
2 B
3 C
4 E
5 F
Edit. Add timings
Although case with pd.to_numeric
is not using apply
method it is almost two times slower than with applying np.isnumeric
for str
columns. Also I add option with using pandas str.isnumeric
which is less typing and still faster then using pd.to_numeric
. But pd.to_numeric
is more general because it could work with any data types (not only strings).
df_big = pd.concat([df]*10000)
In [3]: df_big = pd.concat([df]*10000)
In [4]: df_big.shape
Out[4]: (70000, 2)
In [5]: %timeit df_big[df_big.id.apply(lambda x: x.isnumeric())]
15.3 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [6]: %timeit df_big[df_big.id.str.isnumeric()]
20.3 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit df_big[pd.to_numeric(df_big['id'], errors='coerce').notnull()]
29.9 ms ± 682 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using pd.to_numeric
In [1079]: df[pd.to_numeric(df['id'], errors='coerce').notnull()]
Out[1079]:
id name
0 1 A
1 2 B
2 3 C
4 4 E
5 5 F
x.isnumeric()
does not test return True
when x
is of type float
.
One way to filter out values which can be converted to float
:
df[df['id'].apply(lambda x: is_float(x))]
def is_float(x):
try:
float(x)
except ValueError:
return False
return True
How about this?
The .str
accessor is one of my favorites 🙂
import pandas as pd
df = pd.DataFrame(
{
'id': {0: '1', 1: '2', 2: '3', 3: 'tt', 4: '4', 5: '5', 6: 'de'},
'name': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G'}
}
)
df_clean = df[df.id.str.isnumeric()]
Supplement (2021-06-22)
If the id
contains some kind of headache-makers (such as float
, None
, nan
), you can forcefully cast them to the str
data type using astype('str')
.
import numpy as np
import pandas as pd
df = pd.DataFrame(
{
'id': {0: '1', 1: '2', 2: '3', 3: 3.14, 4: '4', 5: '5', 6: None, 7: np.nan},
'name': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H'}
}
)
df_clean = df[df.id.astype('str').str.isnumeric()]
Primitive, but it works anyway.
This is a dynamic way to do it, this only works for int64 and float 64, if you have other numeric data types in your dataframe make sure you add them to the if statement
# make dataframe of column data types
col_types = df.dtypes.to_frame()
col_types.columns = ['dtype']
#make list of zeros
drop_it = [0]*col_types.shape[0]
k = 0
#make it a one if the data isn't numeric
#if you have other numeric types you need to add them to if statement
for t in col_types.dtype:
if t != 'int64' and t != 'float64':
drop_it[k] = 1
k = k + 1
#delete types from drop list that aren't numeric
col_types['drop_it'] = drop_it
col_types = col_types.loc[col_types["drop_it"] == 1]
#finally drop columns that are in drop list
for col_to_drop in col_types.index.values.tolist():
df = df.drop([col_to_drop], axis = 1)
Another alternative is to use the query
method:
In [5]: df.query('id.str.isnumeric()')
Out[5]:
id name
0 1 A
1 2 B
2 3 C
4 4 E
5 5 F
System.ObjectDisposedException: ‘Cannot access a disposed object. Object name: ‘IServiceProvider’.’
Try to delete the bin folder and rebuild it again. It works for me.