Dropping duplicate rows in a Pandas DataFrame based on multiple column values

Question:

In a dataframe I need to drop/filter out duplicate rows based the combined columns A and B. In the example DataFrame

    A   B   C   D
0   1   1   3   9
1   1   2   4   8
2   1   3   5   7
3   1   3   4   6
4   1   4   5   5
5   1   4   6   4

rows 2 and 3, and 5 and 6 are duplicates and one of them should be dropped, keeping the row with the lowest value of

2 * C + 3 * D

To do this, I created a new temporary score column, S

df['S'] = 2 * df['C'] + 3 * df['D']

and finally to return the index of the minimum value for S

df.loc[df.groupby(['A', 'B'])['S'].idxmin()]
del ['S']

The result is

    A   B   C   D
0   1   1   3   9
1   1   2   4   8
3   1   3   4   6
5   1   4   6   4

However, is there a more efficient way of doing this, without having to add (and later drop) a new column?

Asked By: leofer

||

Answers:

df.drop_duplicates(subset=[‘A’,’B’],inplace=True)

Answered By: Sajil Alakkalakath

Efficient solution for larger dataframes

ix = np.argsort(2 * df['C'] + 3 * df['D'])
result = df.iloc[ix].drop_duplicates(['A', 'B'])

timeit analysis

Setup:

60000 rows with 10000 different groups

Results

%%timeit my solution
# 6.75 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit OP solution
# 595 ms ± 50.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Answered By: Shubham Sharma

Here is a way by using sort_values() and duplicated()

df.loc[~df.sort_values('A',key = lambda x: df['C'].mul(2).add(df['D'].mul(3))).duplicated(subset = ['A','B'])]

Output:

   A  B  C  D
0  1  1  3  9
1  1  2  4  8
3  1  3  4  6
5  1  4  6  4
Answered By: rhug123
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.