Filter out nan rows in a specific column
Question:
df =
Col1 Col2 Col3
1 nan 4
2 5 4
3 3 nan
Given the dataframe df
, I want to obtain a new dataframe df2
that does not contain nan
in the column Col2
. This is the expected result:
df2 =
Col1 Col2 Col3
2 5 4
3 3 nan
I know that it’s possible to use pandas.isnull
and dropna
, however how to specify only particular column to which filtering should be applied?
Answers:
Use dropna
:
df = df.dropna(subset=['Col2'])
print (df)
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
Another solution – boolean indexing
with notnull
:
df = df[df['Col2'].notnull()]
print (df)
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
What is same as:
df = df[~df['Col2'].isnull()]
print (df)
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
you can use DataFrame.dropna()
method:
In [202]: df.dropna(subset=['Col2'])
Out[202]:
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
or (in this case) less idiomatic Series.notnull():
In [204]: df.loc[df.Col2.notnull()]
Out[204]:
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
or using DataFrame.query() method:
In [205]: df.query("Col2 == Col2")
Out[205]:
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
numexpr
solution:
In [241]: import numexpr as ne
In [242]: col = df.Col2
In [243]: df[ne.evaluate("col == col")]
Out[243]:
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
Using numpy
‘s isnan
to mask and construct a new dataframe
m = ~np.isnan(df.Col2.values)
pd.DataFrame(df.values[m], df.index[m], df.columns)
Col1 Col2 Col3
1 2.0 5.0 4.0
2 3.0 3.0 NaN
Timing
Bigger Data
np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice([np.nan, 1], size=(10000, 10))).add_prefix('Col')
%%timeit
m = ~np.isnan(df.Col2.values)
pd.DataFrame(df.values[m], df.index[m], df.columns)
1000 loops, best of 3: 326 µs per loop
%timeit df.query("Col2 == Col2")
1000 loops, best of 3: 1.48 ms per loop
%timeit df.loc[df.Col2.notnull()]
1000 loops, best of 3: 417 µs per loop
%timeit df[~df['Col2'].isnull()]
1000 loops, best of 3: 385 µs per loop
%timeit df.dropna(subset=['Col2'])
1000 loops, best of 3: 913 µs per loop
If you want to count and graph the number of nan’s before dropping your column(s)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
cols = df.columns
nans = [df[col].isna().sum() for col in cols]
sns.set(font_scale=1.1)
ax = sns.barplot(cols, nans, palette='hls', log=False)
ax.set(xlabel='Feature', ylabel='Number of NaNs', title='Number of NaNs per feature')
for p, uniq in zip(ax.patches, nans):
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 10,
uniq,
ha="center")
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.show()
The simple implementation below follows on from the above – but shows filtering out nan rows in a specific column – in place – and for large data frames count rows with nan by column name (before and after)
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,np.nan,'A100'],[4,5,'A213'],[7,8,np.nan],[10,np.nan,'GA23']])
df.columns = ['areaCode','Distance','accountCode']
dataframe
areaCode Distance accountCode
1 NaN A100
4 5.0 A213
7 8.0 NaN
10 NaN GA23
Before: count rows with nan (for each column):
df.isnull().sum()
count by column:
areaCode 0
Distance 2
accountCode 1
dtype: int64
remove unwanted rows in-place:
df.dropna(subset=['Distance'],inplace=True)
After: count rows with nan (for each column):
df.isnull().sum()
count by column:
areaCode 0
Distance 0
accountCode 1
dtype: int64
dataframe:
areaCode Distance accountCode
4 5.0 A213
7 8.0 NaN
df =
Col1 Col2 Col3
1 nan 4
2 5 4
3 3 nan
Given the dataframe df
, I want to obtain a new dataframe df2
that does not contain nan
in the column Col2
. This is the expected result:
df2 =
Col1 Col2 Col3
2 5 4
3 3 nan
I know that it’s possible to use pandas.isnull
and dropna
, however how to specify only particular column to which filtering should be applied?
Use dropna
:
df = df.dropna(subset=['Col2'])
print (df)
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
Another solution – boolean indexing
with notnull
:
df = df[df['Col2'].notnull()]
print (df)
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
What is same as:
df = df[~df['Col2'].isnull()]
print (df)
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
you can use DataFrame.dropna()
method:
In [202]: df.dropna(subset=['Col2'])
Out[202]:
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
or (in this case) less idiomatic Series.notnull():
In [204]: df.loc[df.Col2.notnull()]
Out[204]:
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
or using DataFrame.query() method:
In [205]: df.query("Col2 == Col2")
Out[205]:
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
numexpr
solution:
In [241]: import numexpr as ne
In [242]: col = df.Col2
In [243]: df[ne.evaluate("col == col")]
Out[243]:
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
Using numpy
‘s isnan
to mask and construct a new dataframe
m = ~np.isnan(df.Col2.values)
pd.DataFrame(df.values[m], df.index[m], df.columns)
Col1 Col2 Col3
1 2.0 5.0 4.0
2 3.0 3.0 NaN
Timing
Bigger Data
np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice([np.nan, 1], size=(10000, 10))).add_prefix('Col')
%%timeit
m = ~np.isnan(df.Col2.values)
pd.DataFrame(df.values[m], df.index[m], df.columns)
1000 loops, best of 3: 326 µs per loop
%timeit df.query("Col2 == Col2")
1000 loops, best of 3: 1.48 ms per loop
%timeit df.loc[df.Col2.notnull()]
1000 loops, best of 3: 417 µs per loop
%timeit df[~df['Col2'].isnull()]
1000 loops, best of 3: 385 µs per loop
%timeit df.dropna(subset=['Col2'])
1000 loops, best of 3: 913 µs per loop
If you want to count and graph the number of nan’s before dropping your column(s)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
cols = df.columns
nans = [df[col].isna().sum() for col in cols]
sns.set(font_scale=1.1)
ax = sns.barplot(cols, nans, palette='hls', log=False)
ax.set(xlabel='Feature', ylabel='Number of NaNs', title='Number of NaNs per feature')
for p, uniq in zip(ax.patches, nans):
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 10,
uniq,
ha="center")
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.show()
The simple implementation below follows on from the above – but shows filtering out nan rows in a specific column – in place – and for large data frames count rows with nan by column name (before and after)
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,np.nan,'A100'],[4,5,'A213'],[7,8,np.nan],[10,np.nan,'GA23']])
df.columns = ['areaCode','Distance','accountCode']
dataframe
areaCode Distance accountCode
1 NaN A100
4 5.0 A213
7 8.0 NaN
10 NaN GA23
Before: count rows with nan (for each column):
df.isnull().sum()
count by column:
areaCode 0
Distance 2
accountCode 1
dtype: int64
remove unwanted rows in-place:
df.dropna(subset=['Distance'],inplace=True)
After: count rows with nan (for each column):
df.isnull().sum()
count by column:
areaCode 0
Distance 0
accountCode 1
dtype: int64
dataframe:
areaCode Distance accountCode
4 5.0 A213
7 8.0 NaN