Remove duplicates based on multiple criteria

Question:

Remove duplicates based on multiple criteriaRemove duplicates based on multiple criteria

Asked By: youtube

||

Answers:

Find duplicate rows in a Dataframe based on all or selected columns.

# Selecting duplicate rows except first
# occurrence based on all columns
duplicate = df[df.duplicated(keep = 'first')]
Answered By: Dev

Here’s my best effort attempt at answering your question. The idea is to combine separate dataframes to derive the desired answer.

# Generate Data
data_dict = {
    'name': ['joe', 'sally', 'sarah', 'mary', 'adam', 'sam', 'jill', 'jack', np.nan, np.nan],
    'ID': [123, 123, 342, 342, 342, 214, 214, 992, 379, 579],
    'verified': [213, np.nan, np.nan, 643, 932, np.nan, np.nan, np.nan, np.nan, np.nan]
}
df = pd.DataFrame(data_dict)

# First Part of the Answer
for idx, row in df.iterrows():
    if isinstance(row['name'], float):
        row = row.copy()
        df.at[idx, 'name'] = row['ID']
        
# Second Part of the Answer
a = df.dropna().drop_duplicates(subset=['ID'])
id_a = a['ID'].unique()
b = df[df['verified'].isna()].drop_duplicates(subset=['ID'])
c = b[b['ID'].apply(lambda x: x not in id_a)]

pd.concat([a,c])

First part of the answer addresses "If they don’t have a name value, just replace the NaN with the ID value." and the second part of the answer addresses "I want to take whichever one of those duplicates has a value in the ‘verified’ column. If there is multiple rows with a value in that column, just take the first row that has a verified value. If none of the duplicates have verified rows, then just keep the first duplicate. It is important that all of the rows have a name value."

Answered By: peanutsee

Identify the missing names and print the IDs:

print(f"Missing names: {', '.join(df.loc[df['name'].isna(), 'ID'].astype(str))}")
# Missing names: 379, 579

fillna the names, then get the first non-NA (or first NA if none) per group:

df2 = df.assign(name=df['name'].fillna('ID/'+df['ID'].astype(str)))

df2 = df2.loc[df2['verified'].notna().groupby(df['ID'], sort=False).idxmax()]

print(df2)

Output:

     name   ID  verified
0     joe  123     213.0
3    mary  342     643.0
5     sam  214       NaN
7    jack  992       NaN
8  ID/379  379       NaN
9  ID/579  579       NaN
Answered By: mozway

You can use sort_values with key = lambda x: x != '' to order the rows by whether the verified column has a value. Then giving the parameter keep = 'first' to df.duplicated will keep a row with a nonempty value, if any exist.

Answered By: Acccumulation
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.