In python/pandas is there a way to find a row that has a duplicate value in one column and a unique value in another?

Question

For example:

Say I have a dataframe like

Date Registered	Name	Gift
2021-10-30	John Doe	Money
2021-10-30	John Doe	Food
2021-11-02	Tyler Blue	Gift Card
2021-11-02	Tyler Blue	Food
2021-12-01	John Doe	Supplies

I want to locate all indexes where an entry in name has a unique value in date. Like so:

Date Registered	Name	Gift
2021-10-30	John Doe	Money
2021-11-02	Tyler Blue	Gift Card
2021-12-01	John Doe	Supplies

I tried this:

name_view = df.drop_duplicates(subset=['Name', 'DateTime'], keep= 'last')
def extract_name(TableName):
    return TableName.duplicated(subset=['Name']) 
extract_name(name_view)

But this does not get rid of all the indexes with duplicate dates. Any suggestions? I’m fine with it simply returning a list of the indexes as well, it isn’t required to output the full row.

Asked By: thorscode

||

Source

Answer 1

This will give the requested output as a list of index values:

print(df.reset_index().groupby(['Date Registered','Name']).first()['index'].tolist())

Output:

[0, 2, 4]

Answered By: constantstranger

Answer 2

You were already there with pd.drop_duplicates():

>>> df.drop_duplicates(subset=['Date Registered', 'Name'])
  Date Registered        Name       Gift
0      2021-10-30    John Doe      Money
2      2021-11-02  Tyler Blue  Gift Card
4      2021-12-01    John Doe   Supplies

The indices are therefore:

>>> df.drop_duplicates(subset=['Date Registered', 'Name']).index
Int64Index([0, 2, 4], dtype='int64')

Answered By: T C Molenaar

Answer 3

Adding .reset_index() to the end of your "name_view" variable removed all excess rows when I ran it against your example column (*I had to change the name of the first column to make it work).

name_view = df.drop_duplicates(subset=['Name', 'DateTime'], keep= 'last').reset_index()


def extract_name(TableName):
    return TableName.duplicated(subset=['Name']) 
extract_name(name_view)

Answered By: Omnishroom

Answer 4

Just use pd.drop_duplicates(): on what you already have

import pandas as pd

df= pd.read_csv(csv_file, encoding='latin-1')

df.drop_duplicates(subset=['Date Registered', 'Name'])

print(df)

Answered By: feelsgood

In python/pandas is there a way to find a row that has a duplicate value in one column and a unique value in another?

Question:

Answers: