Fastest Way to Drop Duplicated Index in a Pandas DataFrame
Question:
If I want to drop duplicated index in a dataframe the following doesn’t work for obvious reasons:
myDF.drop_duplicates(cols=index)
and
myDF.drop_duplicates(cols='index')
looks for a column named ‘index’
If I want to drop an index I have to do:
myDF['index'] = myDF.index
myDF= myDF.drop_duplicates(cols='index')
myDF.set_index = myDF['index']
myDF= myDF.drop('index', axis =1)
Is there a more efficient way?
Answers:
You can use numpy.unique
to obtain the index of unique values and use iloc
to get those indices:
>>> df
val
A 0.021372
B 1.229482
D -1.571025
D -0.110083
C 0.547076
B -0.824754
A -1.378705
B -0.234095
C -1.559653
B -0.531421
[10 rows x 1 columns]
>>> idx = np.unique(df.index, return_index=True)[1]
>>> df.iloc[idx]
val
A 0.021372
B 1.229482
C 0.547076
D -1.571025
[4 rows x 1 columns]
Simply: DF.groupby(DF.index).first()
The ‘duplicated’ method works for dataframes and for series. Just select on those rows which aren’t marked as having a duplicate index:
df[~df.index.duplicated()]
If I want to drop duplicated index in a dataframe the following doesn’t work for obvious reasons:
myDF.drop_duplicates(cols=index)
and
myDF.drop_duplicates(cols='index')
looks for a column named ‘index’
If I want to drop an index I have to do:
myDF['index'] = myDF.index
myDF= myDF.drop_duplicates(cols='index')
myDF.set_index = myDF['index']
myDF= myDF.drop('index', axis =1)
Is there a more efficient way?
You can use numpy.unique
to obtain the index of unique values and use iloc
to get those indices:
>>> df
val
A 0.021372
B 1.229482
D -1.571025
D -0.110083
C 0.547076
B -0.824754
A -1.378705
B -0.234095
C -1.559653
B -0.531421
[10 rows x 1 columns]
>>> idx = np.unique(df.index, return_index=True)[1]
>>> df.iloc[idx]
val
A 0.021372
B 1.229482
C 0.547076
D -1.571025
[4 rows x 1 columns]
Simply: DF.groupby(DF.index).first()
The ‘duplicated’ method works for dataframes and for series. Just select on those rows which aren’t marked as having a duplicate index:
df[~df.index.duplicated()]