Check duplicated indices for each subset of values in pandas dataframe

Question:

I have the following dataframe:

import pandas as pd

df_test = pd.DataFrame(data=[['AP1', 'House1'],
                             ['AP1', 'House1'], 
                             ['AP2', 'House1'], 
                             ['AP3', 'House2'], 
                             ['AP4','House2'], 
                             ['AP5', 'House2']],
                       columns=['AP', 'House'],
                       index=[0, 1, 2, 0, 1, 1])

I need to check at each subset of values of a column and see if there are duplicated indices. For example, in column House, we have three entries of House1 and no duplicated indices. But for entry House2 we have one duplicated index 1.

I have tried this:

print(f'{df_test.index.duplicated().sum()} repeated entries')

But this gives 3 duplicated entries, since it does not consider each value of the column separately.

Asked By: Murilo

||

Answers:

A possible solution:

print(df_test.reset_index().duplicated(['index', 'AP']).sum())
print(df_test.reset_index().duplicated(['index', 'House']).sum())

Output:

0
1
Answered By: PaulS

You can use:

>>> (df_test.reset_index(names='Dups')
            .groupby('House', as_index=False)['Dups']
            .agg(lambda x: x.duplicated().sum()))

    House  Dups
0  House1     0
1  House2     1
Answered By: Corralien
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.