Get unique values and column names from a data frame

Question:

I have a data frame with the following columns

col1    col2    col3
a       b       b
c       d       e
e       a       b

I need to make a new data frame with the unique values and corresponding column names (keep set(list) of column names where value occurs in multiple columns). So output would be:

name    col_name
a       [col1, col2]
b       [col2, col3]
c       [col1]
d       [col2]
e       [col1, col3]

How can I construct this from the given data frame?

Asked By: S_S

||

Answers:

Use DataFrame.melt with remove duplicates by DataFrame.drop_duplicates and then aggregate list:

df1 = (df.melt(value_name='name', var_name='col_name')
        .drop_duplicates()
        .groupby('name')['col_name']
        .agg(list)
        .reset_index())

Or remove duplicates by dict.fromkeys trick if ordering is important:

df1 = (df.melt(value_name='name')
        .groupby('name')['variable']
        .agg(lambda x: list(dict.fromkeys(x)))
        .reset_index(name='col_name'))

print (df1)
  name            col_name
0    a        [col1, col2]
1    b        [col2, col3]
2    c              [col1]
3    d              [col2]
4    e        [col1, col3]

If order is not important use sets:

df2 = (df.melt(value_name='name', var_name='col_name')
        .groupby('name')['col_name']
        .agg(lambda x: list(set(x)))
        .reset_index())

print (df2)
  name      col_name
0    a  [col1, col2]
1    b  [col3, col2]
2    c        [col1]
3    d        [col2]
4    e  [col1, col3]
Answered By: jezrael
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.