In python/pandas is there a way to find a row that has a duplicate value in one column and a unique value in another?
Question:
For example:
Say I have a dataframe like
Date Registered
Name
Gift
2021-10-30
John Doe
Money
2021-10-30
John Doe
Food
2021-11-02
Tyler Blue
Gift Card
2021-11-02
Tyler Blue
Food
2021-12-01
John Doe
Supplies
I want to locate all indexes where an entry in name has a unique value in date. Like so:
Date Registered
Name
Gift
2021-10-30
John Doe
Money
2021-11-02
Tyler Blue
Gift Card
2021-12-01
John Doe
Supplies
I tried this:
name_view = df.drop_duplicates(subset=['Name', 'DateTime'], keep= 'last')
def extract_name(TableName):
return TableName.duplicated(subset=['Name'])
extract_name(name_view)
But this does not get rid of all the indexes with duplicate dates. Any suggestions? I’m fine with it simply returning a list of the indexes as well, it isn’t required to output the full row.
Answers:
This will give the requested output as a list of index values:
print(df.reset_index().groupby(['Date Registered','Name']).first()['index'].tolist())
Output:
[0, 2, 4]
You were already there with pd.drop_duplicates()
:
>>> df.drop_duplicates(subset=['Date Registered', 'Name'])
Date Registered Name Gift
0 2021-10-30 John Doe Money
2 2021-11-02 Tyler Blue Gift Card
4 2021-12-01 John Doe Supplies
The indices are therefore:
>>> df.drop_duplicates(subset=['Date Registered', 'Name']).index
Int64Index([0, 2, 4], dtype='int64')
Adding .reset_index()
to the end of your "name_view" variable removed all excess rows when I ran it against your example column (*I had to change the name of the first column to make it work).
name_view = df.drop_duplicates(subset=['Name', 'DateTime'], keep= 'last').reset_index()
def extract_name(TableName):
return TableName.duplicated(subset=['Name'])
extract_name(name_view)
Just use pd.drop_duplicates():
on what you already have
import pandas as pd
df= pd.read_csv(csv_file, encoding='latin-1')
df.drop_duplicates(subset=['Date Registered', 'Name'])
print(df)
For example:
Say I have a dataframe like
Date Registered | Name | Gift |
---|---|---|
2021-10-30 | John Doe | Money |
2021-10-30 | John Doe | Food |
2021-11-02 | Tyler Blue | Gift Card |
2021-11-02 | Tyler Blue | Food |
2021-12-01 | John Doe | Supplies |
I want to locate all indexes where an entry in name has a unique value in date. Like so:
Date Registered | Name | Gift |
---|---|---|
2021-10-30 | John Doe | Money |
2021-11-02 | Tyler Blue | Gift Card |
2021-12-01 | John Doe | Supplies |
I tried this:
name_view = df.drop_duplicates(subset=['Name', 'DateTime'], keep= 'last')
def extract_name(TableName):
return TableName.duplicated(subset=['Name'])
extract_name(name_view)
But this does not get rid of all the indexes with duplicate dates. Any suggestions? I’m fine with it simply returning a list of the indexes as well, it isn’t required to output the full row.
This will give the requested output as a list of index values:
print(df.reset_index().groupby(['Date Registered','Name']).first()['index'].tolist())
Output:
[0, 2, 4]
You were already there with pd.drop_duplicates()
:
>>> df.drop_duplicates(subset=['Date Registered', 'Name'])
Date Registered Name Gift
0 2021-10-30 John Doe Money
2 2021-11-02 Tyler Blue Gift Card
4 2021-12-01 John Doe Supplies
The indices are therefore:
>>> df.drop_duplicates(subset=['Date Registered', 'Name']).index
Int64Index([0, 2, 4], dtype='int64')
Adding .reset_index()
to the end of your "name_view" variable removed all excess rows when I ran it against your example column (*I had to change the name of the first column to make it work).
name_view = df.drop_duplicates(subset=['Name', 'DateTime'], keep= 'last').reset_index()
def extract_name(TableName):
return TableName.duplicated(subset=['Name'])
extract_name(name_view)
Just use pd.drop_duplicates():
on what you already have
import pandas as pd
df= pd.read_csv(csv_file, encoding='latin-1')
df.drop_duplicates(subset=['Date Registered', 'Name'])
print(df)