Pandas drop duplicate rows INCLUDING index

Question:

I know how to drop duplicate rows based on column data. I also know how to drop dublicate rows based on row index. My question is: is there a way to drop duplicate rows based on index and one column?

Thanks!

Asked By: Bimons

||

Answers:

This can be done by turning the index into a column.

Below is a sample data set (fyi, I think someone downvoted your question because it didn’t include a sample data set):

df=pd.DataFrame({'a':[1,2,2,3,4,4,5], 'b':[2,2,2,3,4,5,5]}, index=[0,1,1,2,3,5,5])

Output:

   a  b
0  1  2
1  2  2
1  2  2
2  3  3
3  4  4
5  4  5
5  5  5

Then you can use the following line. The first reset_index() makes a new column with the index numbers. Then you can drop duplicates based on the new index column and the other column (b in this case). Afterward, you can set the index to the original index values with set_index(‘index’):

df.reset_index().drop_duplicates(subset=['index','b']).set_index('index')

Ouput:

       a  b
index      
0      1  2
1      2  2
2      3  3
3      4  4
5      4  5
Answered By: JJ101

If you don’t want to reset and then re-set your index as in JJ101’s answer, you can make use of pandas’ .duplicated() method instead of .drop_duplicates().

If you care about duplicates in the index and some column b, you can identify the corresponding indices with df.index.duplicated() and df.duplicated(subset="b"), respectively. Combine these using an & operator, and then negate that intersection using a ~, and you get something like

clean_df = df[~(df.index.duplicated() & df.duplicated(subset="b"))]
print(clean_df)

Output:

   a  b
0  1  2
1  2  2
2  3  3
3  4  4
5  4  5
Answered By: L0tad