Get unique values and column names from a data frame
Question:
I have a data frame with the following columns
col1 col2 col3
a b b
c d e
e a b
I need to make a new data frame with the unique values and corresponding column names (keep set(list) of column names where value occurs in multiple columns). So output would be:
name col_name
a [col1, col2]
b [col2, col3]
c [col1]
d [col2]
e [col1, col3]
How can I construct this from the given data frame?
Answers:
Use DataFrame.melt
with remove duplicates by DataFrame.drop_duplicates
and then aggregate list
:
df1 = (df.melt(value_name='name', var_name='col_name')
.drop_duplicates()
.groupby('name')['col_name']
.agg(list)
.reset_index())
Or remove duplicates by dict.fromkeys
trick if ordering is important:
df1 = (df.melt(value_name='name')
.groupby('name')['variable']
.agg(lambda x: list(dict.fromkeys(x)))
.reset_index(name='col_name'))
print (df1)
name col_name
0 a [col1, col2]
1 b [col2, col3]
2 c [col1]
3 d [col2]
4 e [col1, col3]
If order is not important use set
s:
df2 = (df.melt(value_name='name', var_name='col_name')
.groupby('name')['col_name']
.agg(lambda x: list(set(x)))
.reset_index())
print (df2)
name col_name
0 a [col1, col2]
1 b [col3, col2]
2 c [col1]
3 d [col2]
4 e [col1, col3]
I have a data frame with the following columns
col1 col2 col3
a b b
c d e
e a b
I need to make a new data frame with the unique values and corresponding column names (keep set(list) of column names where value occurs in multiple columns). So output would be:
name col_name
a [col1, col2]
b [col2, col3]
c [col1]
d [col2]
e [col1, col3]
How can I construct this from the given data frame?
Use DataFrame.melt
with remove duplicates by DataFrame.drop_duplicates
and then aggregate list
:
df1 = (df.melt(value_name='name', var_name='col_name')
.drop_duplicates()
.groupby('name')['col_name']
.agg(list)
.reset_index())
Or remove duplicates by dict.fromkeys
trick if ordering is important:
df1 = (df.melt(value_name='name')
.groupby('name')['variable']
.agg(lambda x: list(dict.fromkeys(x)))
.reset_index(name='col_name'))
print (df1)
name col_name
0 a [col1, col2]
1 b [col2, col3]
2 c [col1]
3 d [col2]
4 e [col1, col3]
If order is not important use set
s:
df2 = (df.melt(value_name='name', var_name='col_name')
.groupby('name')['col_name']
.agg(lambda x: list(set(x)))
.reset_index())
print (df2)
name col_name
0 a [col1, col2]
1 b [col3, col2]
2 c [col1]
3 d [col2]
4 e [col1, col3]