Dropping duplicate rows in a Pandas DataFrame based on multiple column values
Question:
In a dataframe I need to drop/filter out duplicate rows based the combined columns A and B. In the example DataFrame
A B C D
0 1 1 3 9
1 1 2 4 8
2 1 3 5 7
3 1 3 4 6
4 1 4 5 5
5 1 4 6 4
rows 2 and 3, and 5 and 6 are duplicates and one of them should be dropped, keeping the row with the lowest value of
2 * C + 3 * D
To do this, I created a new temporary score column, S
df['S'] = 2 * df['C'] + 3 * df['D']
and finally to return the index of the minimum value for S
df.loc[df.groupby(['A', 'B'])['S'].idxmin()]
del ['S']
The result is
A B C D
0 1 1 3 9
1 1 2 4 8
3 1 3 4 6
5 1 4 6 4
However, is there a more efficient way of doing this, without having to add (and later drop) a new column?
Answers:
df.drop_duplicates(subset=[‘A’,’B’],inplace=True)
Efficient solution for larger dataframes
ix = np.argsort(2 * df['C'] + 3 * df['D'])
result = df.iloc[ix].drop_duplicates(['A', 'B'])
timeit
analysis
Setup:
60000 rows with 10000 different groups
Results
%%timeit my solution
# 6.75 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit OP solution
# 595 ms ± 50.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here is a way by using sort_values()
and duplicated()
df.loc[~df.sort_values('A',key = lambda x: df['C'].mul(2).add(df['D'].mul(3))).duplicated(subset = ['A','B'])]
Output:
A B C D
0 1 1 3 9
1 1 2 4 8
3 1 3 4 6
5 1 4 6 4
In a dataframe I need to drop/filter out duplicate rows based the combined columns A and B. In the example DataFrame
A B C D
0 1 1 3 9
1 1 2 4 8
2 1 3 5 7
3 1 3 4 6
4 1 4 5 5
5 1 4 6 4
rows 2 and 3, and 5 and 6 are duplicates and one of them should be dropped, keeping the row with the lowest value of
2 * C + 3 * D
To do this, I created a new temporary score column, S
df['S'] = 2 * df['C'] + 3 * df['D']
and finally to return the index of the minimum value for S
df.loc[df.groupby(['A', 'B'])['S'].idxmin()]
del ['S']
The result is
A B C D
0 1 1 3 9
1 1 2 4 8
3 1 3 4 6
5 1 4 6 4
However, is there a more efficient way of doing this, without having to add (and later drop) a new column?
df.drop_duplicates(subset=[‘A’,’B’],inplace=True)
Efficient solution for larger dataframes
ix = np.argsort(2 * df['C'] + 3 * df['D'])
result = df.iloc[ix].drop_duplicates(['A', 'B'])
timeit
analysis
Setup:
60000 rows with 10000 different groups
Results
%%timeit my solution
# 6.75 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit OP solution
# 595 ms ± 50.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here is a way by using sort_values()
and duplicated()
df.loc[~df.sort_values('A',key = lambda x: df['C'].mul(2).add(df['D'].mul(3))).duplicated(subset = ['A','B'])]
Output:
A B C D
0 1 1 3 9
1 1 2 4 8
3 1 3 4 6
5 1 4 6 4