Remove duplicate rows but with condition

Question

I have a data frame that looks something like:

df =

date         col1    col2    col3    col4
-----------------------------------------
2022/30/01   2       2       4       5
2022/30/01   2       2       4       5
2022/30/01   0       0       1       2
2022/30/01   0       0       1       2
2022/30/01   3       2       4       2
2022/30/01   5       8       4       3

So basically I have the first two rows being identical, the next two rows also being identical, and the last two different.

What I would like to do is to remove duplicate rows, but only those rows where col1 and col2 equals 0, i.e. the resulting data frame should be:

df_final =

date         col1    col2    col3    col4
-----------------------------------------
2022/30/01   2       2       4       5
2022/30/01   2       2       4       5
2022/30/01   0       0       1       2
2022/30/01   3       2       4       2
2022/30/01   5       8       4       3

Is there any way to accomplish this in an easy way ? I know I could probably do some kind of sorting on the data frame, and then loop through each row and check for conditions. I just suspect that could be a rather time consuming process if there are a lot of rows.

Asked By: Denver Dang

||

Source

Answer 1

You can filter in boolean indexing by chain 2 conditions – find rows without 0 rows and then add first duplicated rows by DataFrame.duplicated, by default are tested all columns, for chain is used | for bitwise OR:

df_final = df[df[['col1','col2']].ne(0).all(axis=1) | ~df.duplicated()]
print (df_final)
         date  col1  col2  col3  col4
0  2022/30/01     2     2     4     5
1  2022/30/01     2     2     4     5
2  2022/30/01     0     0     1     2
4  2022/30/01     3     2     4     2
5  2022/30/01     5     8     4     3

Details:

print (df[['col1','col2']].ne(0).all(axis=1))
0     True
1     True
2    False
3    False
4     True
5     True
dtype: bool

print (~df.duplicated())
0     True
1    False
2     True
3    False
4     True
5     True
dtype: bool

Answered By: jezrael

Answer 2

IIUC, this is a simple selection by boolean masks using duplicated to find the duplicated rows and ne+all to filter the 0 values:

# is the row not a duplicate?
mask1 = ~df.duplicated()
# are col1 and col2 not both 0?
mask2 = df[['col1', 'col2']].ne(0).all(axis=1)
# then keep the data on either of the above conditions
df2 = df[mask1|mask2]

output:

         date  col1  col2  col3  col4
0  2022/30/01     2     2     4     5
1  2022/30/01     2     2     4     5
2  2022/30/01     0     0     1     2
4  2022/30/01     3     2     4     2
5  2022/30/01     5     8     4     3

Alternative with the reverse operations (eq+all)

# Is the row duplicated?
m1 = df.duplicated()
# Are both cols equal to 0?
m2 = df[['col1', 'col2']].eq(0).all(1)
# then keep if not both conditions are met
df[~(m1&m2)]

Answered By: mozway

Answer 3

df1.drop(index=df1.query("col1==0 and col2==0")
         .duplicated().loc[lambda x:x].index)

      date  col1  col2  col3  col4
0  2022/30/01     2     2     4     5
1  2022/30/01     2     2     4     5
2  2022/30/01     0     0     1     2
4  2022/30/01     3     2     4     2
5  2022/30/01     5     8     4     3

Answered By: G.G

Remove duplicate rows but with condition

Question:

Answers: