Combine multiple columns into one category column using the column names as value label
Question:
I have this data
ID A B C
0 0 True False False
1 1 False True False
2 2 False False True
And want to transform it into
ID group
0 0 A
1 1 B
2 2 C
- I want to use the column names as value labels for the
category
column.
- There is a maximum of only one
True
value per row.
This is the MWE
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame({
'ID': range(3),
'A': [True, False, False],
'B': [False, True, False],
'C': [False, False, True]
})
result = pd.DataFrame({
'ID': range(3),
'group': ['A', 'B', 'C']
})
result.group = result.group.astype('category')
print(df)
print(result)
I could do df.apply(lambda row: ...magic.., axis=1)
. But isn’t there a more elegant way with pandas’ own tools?
Answers:
You can use df.dot
:
df['group'] = df[['A', 'B', 'C']].dot(df.columns[1:])
You can use melt then a lookup based on the column where the values are true to get the results you are expecting
df = df.melt(id_vars = 'ID', var_name = 'group')
df.loc[df['value'] == True][['ID', 'group']]
You could use pd.melt()
to reshape and rename, then boolean filtering on the ‘value’ column using query
:
pd.melt(df,id_vars=['ID'],var_name= 'group').query('value')
ID group value
0 0 A True
4 1 B True
8 2 C True
Chaining .drop('value',axis=1).reset_index(drop=True)
will give your final output:
ID group
0 0 A
1 1 B
2 2 C
Yet another way:
df.set_index(['ID'])
.rename_axis('group', axis=1) # getting column name correct
.stack() # reshaping getting column headers into dataframe rows
.loc[lambda x: x] # filtering for True
.reset_index() # moving ID back into dataframe columns
.drop(0, axis=1) # dropping boolean column
Output:
ID group
0 0 A
1 1 B
2 2 C
More verbose than melt
, but this drops the invalid columns during the reshaping:
(df.set_index('ID')
.rename_axis(columns='group')
.replace(False, pd.NA)
.stack().reset_index().drop(columns=0)
)
output:
ID group
0 0 A
1 1 B
2 2 C
Try with apply
lambda
df.set_index('ID').apply(lambda x : x.index[x][0],axis=1)
Out[39]:
ID
0 A
1 B
2 C
dtype: object
idxmax
s = df.set_index('ID')
s.idxmax(1).where(s.any(1))
ID
0 A
1 B
2 C
dtype: object
df['group'] = df[['A', 'B', 'C']].dot(df.columns[1:])
I have this data
ID A B C
0 0 True False False
1 1 False True False
2 2 False False True
And want to transform it into
ID group
0 0 A
1 1 B
2 2 C
- I want to use the column names as value labels for the
category
column. - There is a maximum of only one
True
value per row.
This is the MWE
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame({
'ID': range(3),
'A': [True, False, False],
'B': [False, True, False],
'C': [False, False, True]
})
result = pd.DataFrame({
'ID': range(3),
'group': ['A', 'B', 'C']
})
result.group = result.group.astype('category')
print(df)
print(result)
I could do df.apply(lambda row: ...magic.., axis=1)
. But isn’t there a more elegant way with pandas’ own tools?
You can use df.dot
:
df['group'] = df[['A', 'B', 'C']].dot(df.columns[1:])
You can use melt then a lookup based on the column where the values are true to get the results you are expecting
df = df.melt(id_vars = 'ID', var_name = 'group')
df.loc[df['value'] == True][['ID', 'group']]
You could use pd.melt()
to reshape and rename, then boolean filtering on the ‘value’ column using query
:
pd.melt(df,id_vars=['ID'],var_name= 'group').query('value')
ID group value
0 0 A True
4 1 B True
8 2 C True
Chaining .drop('value',axis=1).reset_index(drop=True)
will give your final output:
ID group
0 0 A
1 1 B
2 2 C
Yet another way:
df.set_index(['ID'])
.rename_axis('group', axis=1) # getting column name correct
.stack() # reshaping getting column headers into dataframe rows
.loc[lambda x: x] # filtering for True
.reset_index() # moving ID back into dataframe columns
.drop(0, axis=1) # dropping boolean column
Output:
ID group
0 0 A
1 1 B
2 2 C
More verbose than melt
, but this drops the invalid columns during the reshaping:
(df.set_index('ID')
.rename_axis(columns='group')
.replace(False, pd.NA)
.stack().reset_index().drop(columns=0)
)
output:
ID group
0 0 A
1 1 B
2 2 C
Try with apply
lambda
df.set_index('ID').apply(lambda x : x.index[x][0],axis=1)
Out[39]:
ID
0 A
1 B
2 C
dtype: object
idxmax
s = df.set_index('ID')
s.idxmax(1).where(s.any(1))
ID
0 A
1 B
2 C
dtype: object
df['group'] = df[['A', 'B', 'C']].dot(df.columns[1:])