Remove duplicate values across columns in pandas dataframe, without removing entire row

Question:

I would like to drop all values which are duplicates across a subset of two or more columns, without removing the entire row.

Dataframe:

    A   B   C
0   foo g   A
1   foo g   G
2   yes y   B
3   bar y   B

Desired result:

    A   B   C
0   foo g   A
1   NaN NaN G
2   yes y   B
3   bar Nan NaN

I have tried the drop_duplicates() feature by grouping data into new data frames by columns and then re-appending them together, but this had its own issues.

I have also tried this solution and this one, but still am stuck. Any guidance would be much appreciated.

(updated original question)

Asked By: btroppo

||

Answers:

Without removing the entire rows, you can filter the duplicated value with NaN.

#df : your dataframe    
for c_name in df.columns:
      duplicated = df.duplicated(c_name)
      df.loc[duplicated, [c_name]] = np.NaN
    
    print(df)

I referred to this.

Answered By: HSL

try this:

result = df.mask(df.apply(pd.Series.duplicated))
print(result)
>>>
     A    B    C
0  foo    g    A
1  NaN  NaN    G
2  yes    y    B
3  bar  NaN  NaN
Answered By: ziying35

Go through the codes. You will clearly see the difference between mask and where.

import pandas as pd
import numpy as np


df = pd.DataFrame(columns=['A','B','C'])
df['A'] = ['foo','foo', 'yes','bar' ]
df['B'] = ['g','g', 'y', 'y']
df['C'] = ['A','G','B','B']
print(df)
"""
     A  B  C
0  foo  g  A
1  foo  g  G
2  yes  y  B
3  bar  y  B

"""

aa = df.apply(pd.Series.duplicated)
print(aa)
"""
       A      B      C
0  False  False  False
1   True   True  False
2  False  False  False
3  False   True   True
"""
using_where = df.where(~aa)
print(using_where)
"""
    A    B    C
0  foo    g    A
1  NaN  NaN    G
2  yes    y    B
3  bar  NaN  NaN

"""
using_mask = df.mask(aa)
print(using_mask)

"""
     A    B    C
0  foo    g    A
1  NaN  NaN    G
2  yes    y    B
3  bar  NaN  NaN
"""
Answered By: Soudipta Dutta