I wish to condense my dataset. Essentially it is a groupby.
id box status aa box11 hey aa box11 hey aa box11 hey aa box11 hey aa box5 hello aa box5 hello aa box5 hello aa box5 hello aa box5 hello bb box8 no bb box8 no
id box status aa box11 hey aa box5 hello bb box8 no
df1 = df.groupby(["id"])["box"]).agg()
If you want to be careful and exclude "id" you can use the subset keyword:
df1 = df.drop_duplicates(subset = ['box', 'status'])
To clarify, drop_duplicates() will only drop rows if the full row is duplicated. Subset just tells it which rows to consider. If you had a row where box=’box8′ and status=’hey’, this row would not drop. Both are duplicates individually but are in a unique combination.