Python – Drop row if two columns are NaN
Question:
This is an extension to this question, where OP wanted to know how to drop rows where the values in a single column are NaN.
I’m wondering how I can drop rows where the values in 2 (or more) columns are both NaN. Using the second answer’s created Data Frame:
In [1]: df = pd.DataFrame(np.random.randn(10,3))
In [2]: df.ix[::2,0] = np.nan; df.ix[::4,1] = np.nan; df.ix[::3,2] = np.nan;
In [3]: df
Out[3]:
0 1 2
0 NaN NaN NaN
1 2.677677 -1.466923 -0.750366
2 NaN 0.798002 -0.906038
3 0.672201 0.964789 NaN
4 NaN NaN 0.050742
5 -1.250970 0.030561 -2.678622
6 NaN 1.036043 NaN
7 0.049896 -0.308003 0.823295
8 NaN NaN 0.637482
9 -0.310130 0.078891 NaN
If I use the drop.na()
command, specifically the drop.na(subset=[1,2])
, then it completes an “or” type drop and leaves:
In[4]: df.dropna(subset=[1,2])
Out[4]:
0 1 2
1 2.677677 -1.466923 -0.750366
2 NaN 0.798002 -0.906038
5 -1.250970 0.030561 -2.678622
7 0.049896 -0.308003 0.823295
What I want is an “and” type drop, where it drops rows where there is an NaN
in column index 1 and 2. This would leave:
0 1 2
1 2.677677 -1.466923 -0.750366
2 NaN 0.798002 -0.906038
3 0.672201 0.964789 NaN
4 NaN NaN 0.050742
5 -1.250970 0.030561 -2.678622
6 NaN 1.036043 NaN
7 0.049896 -0.308003 0.823295
8 NaN NaN 0.637482
9 -0.310130 0.078891 NaN
where only the first row is dropped.
Any ideas?
EDIT: changed data frame values for consistency
Answers:
Any one of the following two:
df.dropna(subset=[1, 2], how='all')
or
df.dropna(subset=[1, 2], thresh=1)
Specify dropna()
method:
df.dropna(subset=[1,2], how='all')
The accepted answer didn’t work for me. I tried the following code and nothing changed in my dataframe.
df.dropna(subset=['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], how='all', inplace=True)
EDIT: I think it didn’t work because ‘all’ refers to all columns in the df
, and not all columns in the subset
.
The thresh
is also a bit confusing for me, because it’s a condition to keep rows, even though this a dropping function. In the OP, if I understand correctly, it’s required to have at least one non nan value to keep a row, and if a row has more than 2 nan values, it should be dropped.
In my case, I want to drop a row if all values in the columns of the subset
have nan values, so I want to keep a row if all 6 values are not nan, and I don’t know if this corresponds to thresh=6
.
These two commands also didn’t work for me:
df.dropna(subset=['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], thres=6, inplace=True)
df.dropna(subset=['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], how='all', thres=6, inplace=True)
and when I added axis=1
or axis='columns'
in the parameters, I got a KeyError
that the list of columns I am passing are not in the dataframe, and I don’t know the reason why.
I ended up iterating over the df
like this, also because I realized that some supposedly NaN values, might not be caught, because they are Null or empty:
indices_to_drop = list()
for index, row in df.iterrows():
if (pd.isnull(row["col1"]) or pd.isna(row["col1"]) or row["col1"] == "") and
(pd.isnull(row["col2"]) or pd.isna(row["col2"]) or row["col2"] == "") and
(pd.isnull(row["col3"]) or pd.isna(row["col3"]) or row["col3"]
== "") and
(pd.isnull(row["col4"]) or pd.isna(row["col4"]) or row["col4"] ==
"")
and
(pd.isnull(row["col5"]) or pd.isna(row["col5"]) or row[
"col6"] == "") and
(pd.isnull(row["col6"]) or pd.isna(row["col6"]) or row["col6"] == ""):
indices_to_drop.append(index)
df.drop(labels=indices_to_drop, inplace=True)
I faced a similar issue where I’d 45 features(columns) and wanted to drop rows for only selected features having NaN
values eg columns 7 to 45.
Step 1: I created a list (col_lst
) from columns which I wanted to be operated for NaN
Step 2: df.dropna(axis = 0, subset = col_lst, how = 'all', inplace = True)
The above step removed only those rows fromthe dataframe which had all (not any) the columns from 7 to 45 with NaN
values.
Try this code below, I hope it can solve your problem
drop_rows = df.dropna( how='any',
subset=['Columns1', 'Columns2'])
This is an extension to this question, where OP wanted to know how to drop rows where the values in a single column are NaN.
I’m wondering how I can drop rows where the values in 2 (or more) columns are both NaN. Using the second answer’s created Data Frame:
In [1]: df = pd.DataFrame(np.random.randn(10,3))
In [2]: df.ix[::2,0] = np.nan; df.ix[::4,1] = np.nan; df.ix[::3,2] = np.nan;
In [3]: df
Out[3]:
0 1 2
0 NaN NaN NaN
1 2.677677 -1.466923 -0.750366
2 NaN 0.798002 -0.906038
3 0.672201 0.964789 NaN
4 NaN NaN 0.050742
5 -1.250970 0.030561 -2.678622
6 NaN 1.036043 NaN
7 0.049896 -0.308003 0.823295
8 NaN NaN 0.637482
9 -0.310130 0.078891 NaN
If I use the drop.na()
command, specifically the drop.na(subset=[1,2])
, then it completes an “or” type drop and leaves:
In[4]: df.dropna(subset=[1,2])
Out[4]:
0 1 2
1 2.677677 -1.466923 -0.750366
2 NaN 0.798002 -0.906038
5 -1.250970 0.030561 -2.678622
7 0.049896 -0.308003 0.823295
What I want is an “and” type drop, where it drops rows where there is an NaN
in column index 1 and 2. This would leave:
0 1 2
1 2.677677 -1.466923 -0.750366
2 NaN 0.798002 -0.906038
3 0.672201 0.964789 NaN
4 NaN NaN 0.050742
5 -1.250970 0.030561 -2.678622
6 NaN 1.036043 NaN
7 0.049896 -0.308003 0.823295
8 NaN NaN 0.637482
9 -0.310130 0.078891 NaN
where only the first row is dropped.
Any ideas?
EDIT: changed data frame values for consistency
Any one of the following two:
df.dropna(subset=[1, 2], how='all')
or
df.dropna(subset=[1, 2], thresh=1)
Specify dropna()
method:
df.dropna(subset=[1,2], how='all')
The accepted answer didn’t work for me. I tried the following code and nothing changed in my dataframe.
df.dropna(subset=['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], how='all', inplace=True)
EDIT: I think it didn’t work because ‘all’ refers to all columns in the df
, and not all columns in the subset
.
The thresh
is also a bit confusing for me, because it’s a condition to keep rows, even though this a dropping function. In the OP, if I understand correctly, it’s required to have at least one non nan value to keep a row, and if a row has more than 2 nan values, it should be dropped.
In my case, I want to drop a row if all values in the columns of the subset
have nan values, so I want to keep a row if all 6 values are not nan, and I don’t know if this corresponds to thresh=6
.
These two commands also didn’t work for me:
df.dropna(subset=['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], thres=6, inplace=True)
df.dropna(subset=['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], how='all', thres=6, inplace=True)
and when I added axis=1
or axis='columns'
in the parameters, I got a KeyError
that the list of columns I am passing are not in the dataframe, and I don’t know the reason why.
I ended up iterating over the df
like this, also because I realized that some supposedly NaN values, might not be caught, because they are Null or empty:
indices_to_drop = list()
for index, row in df.iterrows():
if (pd.isnull(row["col1"]) or pd.isna(row["col1"]) or row["col1"] == "") and
(pd.isnull(row["col2"]) or pd.isna(row["col2"]) or row["col2"] == "") and
(pd.isnull(row["col3"]) or pd.isna(row["col3"]) or row["col3"]
== "") and
(pd.isnull(row["col4"]) or pd.isna(row["col4"]) or row["col4"] ==
"")
and
(pd.isnull(row["col5"]) or pd.isna(row["col5"]) or row[
"col6"] == "") and
(pd.isnull(row["col6"]) or pd.isna(row["col6"]) or row["col6"] == ""):
indices_to_drop.append(index)
df.drop(labels=indices_to_drop, inplace=True)
I faced a similar issue where I’d 45 features(columns) and wanted to drop rows for only selected features having NaN
values eg columns 7 to 45.
Step 1: I created a list (col_lst
) from columns which I wanted to be operated for NaN
Step 2: df.dropna(axis = 0, subset = col_lst, how = 'all', inplace = True)
The above step removed only those rows fromthe dataframe which had all (not any) the columns from 7 to 45 with NaN
values.
Try this code below, I hope it can solve your problem
drop_rows = df.dropna( how='any',
subset=['Columns1', 'Columns2'])