In python/pandas is there a way to find a row that has a duplicate value in one column and a unique value in another?

Question:

For example:

Say I have a dataframe like

Date Registered Name Gift
2021-10-30 John Doe Money
2021-10-30 John Doe Food
2021-11-02 Tyler Blue Gift Card
2021-11-02 Tyler Blue Food
2021-12-01 John Doe Supplies

I want to locate all indexes where an entry in name has a unique value in date. Like so:

Date Registered Name Gift
2021-10-30 John Doe Money
2021-11-02 Tyler Blue Gift Card
2021-12-01 John Doe Supplies

I tried this:

name_view = df.drop_duplicates(subset=['Name', 'DateTime'], keep= 'last')
def extract_name(TableName):
    return TableName.duplicated(subset=['Name']) 
extract_name(name_view)

But this does not get rid of all the indexes with duplicate dates. Any suggestions? I’m fine with it simply returning a list of the indexes as well, it isn’t required to output the full row.

Asked By: thorscode

||

Answers:

This will give the requested output as a list of index values:

print(df.reset_index().groupby(['Date Registered','Name']).first()['index'].tolist())

Output:

[0, 2, 4]
Answered By: constantstranger

You were already there with pd.drop_duplicates():

>>> df.drop_duplicates(subset=['Date Registered', 'Name'])
  Date Registered        Name       Gift
0      2021-10-30    John Doe      Money
2      2021-11-02  Tyler Blue  Gift Card
4      2021-12-01    John Doe   Supplies

The indices are therefore:

>>> df.drop_duplicates(subset=['Date Registered', 'Name']).index
Int64Index([0, 2, 4], dtype='int64')
Answered By: T C Molenaar

Adding .reset_index() to the end of your "name_view" variable removed all excess rows when I ran it against your example column (*I had to change the name of the first column to make it work).

name_view = df.drop_duplicates(subset=['Name', 'DateTime'], keep= 'last').reset_index()


def extract_name(TableName):
    return TableName.duplicated(subset=['Name']) 
extract_name(name_view)
Answered By: Omnishroom

Just use pd.drop_duplicates(): on what you already have

import pandas as pd

df= pd.read_csv(csv_file, encoding='latin-1')

df.drop_duplicates(subset=['Date Registered', 'Name'])

print(df)
Answered By: feelsgood
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.