Pandas dataframe: change unique values in each column to NaNs

Question:

I have a DataFrame arranged in a manner similar to this:

ID   Sample_1   Sample_2
A    0.182      0.754
B    0.182      0.754
C    0.182      0.01
D    0.182      0.2
E    0.9        0.2

As you can see, there are some repeated values ("true" measurements) and single values (that are actually "bad" measurements). What I need to do is to replace all unique values (that are so-called "bad") with NAs. This needs to be done for all columns.

In other words, the final dataframe should look like this:

ID   Sample_1   Sample_2
A    0.182      0.754
B    0.182      0.754
C    0.182      NaN
D    0.182      0.2
E    NaN        0.2

A possible solution I’ve thought about involves groupby and filter to get the index values (like in Get indexes of unique values in column (pandas)) and then replace the values, but the issue is that it works only for one column at a time:

unique_loc = df.groupby("Sample_1").filter(lambda x: len(x) == 1).index
df.loc[unique_loc, "Sample_1"] = np.nan

This means it would need to get repeated for many columns (and I have many in the actual data). Is there a more efficient solution?

Asked By: Einar

||

Answers:

Use DataFrame.where for set NaNs with mask created by Series.duplicated per columns in DataFrame.apply, for selecting all columns without first is used DataFrame.iloc:

df.iloc[:, 1:]=df.iloc[:, 1:].where(df.iloc[:, 1:].apply(lambda x:x.duplicated(keep=False)))
print (df)
  ID  Sample_1  Sample_2
0  A     0.182     0.754
1  B     0.182     0.754
2  C     0.182       NaN
3  D     0.182     0.200
4  E       NaN     0.200

If ID is index:

df = df.set_index('ID')

df = df.where(df.apply(lambda x: x.duplicated(keep=False)))
print (df)
    Sample_1  Sample_2
ID                    
A      0.182     0.754
B      0.182     0.754
C      0.182       NaN
D      0.182     0.200
E        NaN     0.200
Answered By: jezrael

One option is to get the values via np.unique, and replace them with pd.DataFrame.mask:

arr, counts = np.unique(df.filter(like='Sample').to_numpy(), return_counts=True)
df.mask(df.isin(arr[counts==1]))

  ID  Sample_1  Sample_2
0  A     0.182     0.754
1  B     0.182     0.754
2  C     0.182       NaN
3  D     0.182     0.200
4  E       NaN     0.200

Answered By: sammywemmy
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.