how can I remove Label conflict in classification problem?

Question:

I have identical samples with different labels and this has occurred due to either mislabeled data, If the data is mislabeled, it can confuse the model and can result in lower performance of the model.

It’s a binary classification problem.
if my input table is somethin like below

enter image description here

I want below table as my cleaned data

enter image description here

I tied this data cleaning library to check conflict but was not able to clean it :https://docs.deepchecks.com/stable/checks_gallery/tabular/data_integrity/plot_conflicting_labels.html#

and my custom function take lots of time to run,
whats the most efficient way to run when i have 2M records to clean?

Asked By: Shiv948

||

Answers:

You can use drop_duplicates with a subset:

out = df.drop_duplicates(['A', 'B', 'C'], ignore_index=True)
print(out)

# Output
   A  B  C  Target
0  1  2  3       0
1  2  8  9       1
2  9  6  5       1
3  3  7  0       0
Answered By: Corralien