Pandas: Get duplicated indexes
Question:
Given a dataframe, I want to get the duplicated indexes, which do not have duplicate values in the columns, and see which values are different.
Specifically, I have this dataframe:
import pandas as pd
wget https://www.dropbox.com/s/vmimze2g4lt4ud3/alt_exon_repeatmasker_intersect.bed
alt_exon_repeatmasker = pd.read_table('alt_exon_repeatmasker_intersect.bed', header=None, index_col=3)
In [74]: alt_exon_repeatmasker.index.is_unique
Out[74]: False
And some of the indexes have duplicate values in the 9th column (the type of DNA repetitive element in this location), and I want to know what are the different types of repetitive elements for individual locations (each index = a genome location).
I’m guessing this will require some kind of groupby
and hopefully some groupby
ninja can help me out.
To simplify even further, if we only have the index and the repeat type,
genome_location1 MIR3
genome_location1 AluJb
genome_location2 Tigger1
genome_location3 AT_rich
So the output I’d like to see all duplicate indexes and their repeat types, as such:
genome_location1 MIR3
genome_location1 AluJb
EDIT: added toy example
Answers:
>>> df[df.groupby(level=0).transform(len)['type'] > 1]
type
genome_location1 MIR3
genome_location1 AluJb
df.groupby(level=0).filter(lambda x: len(x) > 1)['type']
We added filter
method for this kind of operation. You can also use masking and transform for equivalent results, but this is faster, and a little more readable too.
Important:
The filter
method was introduced in version 0.12, but it failed to work on DataFrames/Series with nonunique indexes. The issue — and a related issue with transform
on Series — was fixed for version 0.13, which should be released any day now.
Clearly, nonunique indexes are the heart of this question, so I should point out that this approach will not help until you have pandas 0.13. In the meantime, the transform
workaround is the way to go. Be ware that if you try that on a Series with a nonunique index, it too will fail.
There is no good reason why filter
and transform
should not be applied to nonunique indexes; it was just poorly implemented at first.
More succinctly:
df[df.groupby(level=0).type.count() > 1]
FYI a multi-index:
df[df.groupby(level=[0,1]).type.count() > 1]
Also useful and very succinct:
df[df.index.duplicated()]
Note that this only returns one of the duplicated rows, so to see all the duplicated rows you’ll want this:
df[df.index.duplicated(keep=False)]
Even faster and better:
df.index.get_duplicates()
As of 9/21/18 Pandas indicates FutureWarning: 'get_duplicates' is deprecated and will be removed in a future release
, instead suggesting the following:
df.index[df.index.duplicated()].unique()
This gives you index values along with a preview of duplicated rows
def dup_rows_index(df):
dup = df[df.duplicated()]
print('Duplicated index loc:',dup[dup == True ].index.tolist())
return dup
Given a dataframe, I want to get the duplicated indexes, which do not have duplicate values in the columns, and see which values are different.
Specifically, I have this dataframe:
import pandas as pd
wget https://www.dropbox.com/s/vmimze2g4lt4ud3/alt_exon_repeatmasker_intersect.bed
alt_exon_repeatmasker = pd.read_table('alt_exon_repeatmasker_intersect.bed', header=None, index_col=3)
In [74]: alt_exon_repeatmasker.index.is_unique
Out[74]: False
And some of the indexes have duplicate values in the 9th column (the type of DNA repetitive element in this location), and I want to know what are the different types of repetitive elements for individual locations (each index = a genome location).
I’m guessing this will require some kind of groupby
and hopefully some groupby
ninja can help me out.
To simplify even further, if we only have the index and the repeat type,
genome_location1 MIR3
genome_location1 AluJb
genome_location2 Tigger1
genome_location3 AT_rich
So the output I’d like to see all duplicate indexes and their repeat types, as such:
genome_location1 MIR3
genome_location1 AluJb
EDIT: added toy example
>>> df[df.groupby(level=0).transform(len)['type'] > 1]
type
genome_location1 MIR3
genome_location1 AluJb
df.groupby(level=0).filter(lambda x: len(x) > 1)['type']
We added filter
method for this kind of operation. You can also use masking and transform for equivalent results, but this is faster, and a little more readable too.
Important:
The filter
method was introduced in version 0.12, but it failed to work on DataFrames/Series with nonunique indexes. The issue — and a related issue with transform
on Series — was fixed for version 0.13, which should be released any day now.
Clearly, nonunique indexes are the heart of this question, so I should point out that this approach will not help until you have pandas 0.13. In the meantime, the transform
workaround is the way to go. Be ware that if you try that on a Series with a nonunique index, it too will fail.
There is no good reason why filter
and transform
should not be applied to nonunique indexes; it was just poorly implemented at first.
More succinctly:
df[df.groupby(level=0).type.count() > 1]
FYI a multi-index:
df[df.groupby(level=[0,1]).type.count() > 1]
Also useful and very succinct:
df[df.index.duplicated()]
Note that this only returns one of the duplicated rows, so to see all the duplicated rows you’ll want this:
df[df.index.duplicated(keep=False)]
Even faster and better:
df.index.get_duplicates()
As of 9/21/18 Pandas indicates FutureWarning: 'get_duplicates' is deprecated and will be removed in a future release
, instead suggesting the following:
df.index[df.index.duplicated()].unique()
This gives you index values along with a preview of duplicated rows
def dup_rows_index(df):
dup = df[df.duplicated()]
print('Duplicated index loc:',dup[dup == True ].index.tolist())
return dup