Remove duplicate values across columns in pandas dataframe, without removing entire row
Question:
I would like to drop all values which are duplicates across a subset of two or more columns, without removing the entire row.
Dataframe:
A B C
0 foo g A
1 foo g G
2 yes y B
3 bar y B
Desired result:
A B C
0 foo g A
1 NaN NaN G
2 yes y B
3 bar Nan NaN
I have tried the drop_duplicates()
feature by grouping data into new data frames by columns and then re-appending them together, but this had its own issues.
I have also tried this solution and this one, but still am stuck. Any guidance would be much appreciated.
(updated original question)
Answers:
Without removing the entire rows, you can filter the duplicated value with NaN.
#df : your dataframe
for c_name in df.columns:
duplicated = df.duplicated(c_name)
df.loc[duplicated, [c_name]] = np.NaN
print(df)
I referred to this.
try this:
result = df.mask(df.apply(pd.Series.duplicated))
print(result)
>>>
A B C
0 foo g A
1 NaN NaN G
2 yes y B
3 bar NaN NaN
Go through the codes. You will clearly see the difference between mask and where.
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['A','B','C'])
df['A'] = ['foo','foo', 'yes','bar' ]
df['B'] = ['g','g', 'y', 'y']
df['C'] = ['A','G','B','B']
print(df)
"""
A B C
0 foo g A
1 foo g G
2 yes y B
3 bar y B
"""
aa = df.apply(pd.Series.duplicated)
print(aa)
"""
A B C
0 False False False
1 True True False
2 False False False
3 False True True
"""
using_where = df.where(~aa)
print(using_where)
"""
A B C
0 foo g A
1 NaN NaN G
2 yes y B
3 bar NaN NaN
"""
using_mask = df.mask(aa)
print(using_mask)
"""
A B C
0 foo g A
1 NaN NaN G
2 yes y B
3 bar NaN NaN
"""
I would like to drop all values which are duplicates across a subset of two or more columns, without removing the entire row.
Dataframe:
A B C
0 foo g A
1 foo g G
2 yes y B
3 bar y B
Desired result:
A B C
0 foo g A
1 NaN NaN G
2 yes y B
3 bar Nan NaN
I have tried the drop_duplicates()
feature by grouping data into new data frames by columns and then re-appending them together, but this had its own issues.
I have also tried this solution and this one, but still am stuck. Any guidance would be much appreciated.
(updated original question)
Without removing the entire rows, you can filter the duplicated value with NaN.
#df : your dataframe
for c_name in df.columns:
duplicated = df.duplicated(c_name)
df.loc[duplicated, [c_name]] = np.NaN
print(df)
I referred to this.
try this:
result = df.mask(df.apply(pd.Series.duplicated))
print(result)
>>>
A B C
0 foo g A
1 NaN NaN G
2 yes y B
3 bar NaN NaN
Go through the codes. You will clearly see the difference between mask and where.
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['A','B','C'])
df['A'] = ['foo','foo', 'yes','bar' ]
df['B'] = ['g','g', 'y', 'y']
df['C'] = ['A','G','B','B']
print(df)
"""
A B C
0 foo g A
1 foo g G
2 yes y B
3 bar y B
"""
aa = df.apply(pd.Series.duplicated)
print(aa)
"""
A B C
0 False False False
1 True True False
2 False False False
3 False True True
"""
using_where = df.where(~aa)
print(using_where)
"""
A B C
0 foo g A
1 NaN NaN G
2 yes y B
3 bar NaN NaN
"""
using_mask = df.mask(aa)
print(using_mask)
"""
A B C
0 foo g A
1 NaN NaN G
2 yes y B
3 bar NaN NaN
"""