How to get unique sets of data in pandas

Question:

Imagine people were asked their favorite color and their answers stored in a DataFrame (Person – Color). Some people name only one color, some — more. I need to find unique combinations (sets) of colors.

import pandas as pd

data_list=['black',
           'red',
           'red',
           'black',
           'red',
           'white',
           'blue',
           'blue',
           'purple',
           'black',
           'green',
           'white',
           'black',
           'blue',
           'red',
          ]
names_list=['Michael',
            'Michael',
            'Tony',
            'Tony',
            'Helen',
            'Margo',
            'Max',
            'John',
            'Roxanne',
            'Roxanne',
            'Roxanne',
            'Marla',
            'Jack',
            'Jack',
            'Jack',
           ]
df=pd.DataFrame(data=zip(names_list, data_list), columns=['Person', 'Colors'])

I tried grouping the colors accordingly:

df.groupby('Names').agg({'Colors': pd.unique})

The output would be

                         Colors
Person
Helen                     [red]
Jack         [black, blue, red]
John                     [blue]
Margo                   [white]
Marla                   [white]
Max                      [blue]
Michael            [black, red]
Roxanne  [purple, black, green]
Tony               [red, black]

So I need one final aggregation of colors and I have no idea how to achieve it. Another iteration of pd.unique() doesn’t work because sets are non-hashable. Converting sets to tuples doesn’t work either because (black, red) is different from (red, black).
In the end I expect a result of

[red]
[black, blue, red]
[blue]
[white]
[black, red]
[purple, black, green]

(order doesn’t matter)
Then I’ll try to get some statistics like "how much people named only blue color their favorite and how much named red and black" but first thing first, I need to get all unique combinations of colors in a set.
Any advice appreciated.

Asked By: usualMortal

||

Answers:

You could make a new column with a joined string of all the colors:

df_grouped = df.groupby('Names').agg({'Colors': pd.unique})
df_grouped['Colors str'] = [''.join(sorted(color_list)) for color_list in df_grouped['Colors']]

# drop duplicate color combinations: 
df_grouped = df_grouped.drop_duplicates(subset=['Colors str'])

print(f"Unique color combinations: n {df_grouped ['Colors']}")

Answered By: Zephyrus
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.