How to reverse a dummy variables from a pandas dataframe
Question:
I would like to reverse a dataframe with dummy variables. For example,
from df_input:
Course_01 Course_02 Course_03
0 0 1
1 0 0
0 1 0
To df_output
Course
0 03
1 01
2 02
I have been looking at the solution provided at Reconstruct a categorical variable from dummies in pandas but it did not work. Please, Any help would be much appreciated.
Many Thanks,
Best Regards,
Carlo
Answers:
Suppose you have the following dummy DF:
In [152]: d
Out[152]:
id T_30 T_40 T_50
0 id1 0 1 1
1 id2 1 0 1
we can prepare the following helper Series:
In [153]: v = pd.Series(d.columns.drop('id').str.replace(r'D','').astype(int), index=d.columns.drop('id'))
In [155]: v
Out[155]:
T_30 30
T_40 40
T_50 50
dtype: int64
now we can multiply them, stack and filter:
In [154]: d.set_index('id').mul(v).stack().reset_index(name='T').drop('level_1',1).query("T > 0")
Out[154]:
id T
1 id1 40
2 id1 50
3 id2 30
5 id2 50
We can use wide_to_long
, then select rows that are not equal to zero i.e
ndf = pd.wide_to_long(df, stubnames='T_', i='id',j='T')
T_
id T
id1 30 0
id2 30 1
id1 40 1
id2 40 0
not_dummy = ndf[ndf['T_'].ne(0)].reset_index().drop('T_',1)
id T
0 id2 30
1 id1 40
Update based on your edit :
ndf = pd.wide_to_long(df.reset_index(), stubnames='T_',i='index',j='T')
not_dummy = ndf[ndf['T_'].ne(0)].reset_index(level='T').drop('T_',1)
T
index
1 30
0 40
You can use:
#create id to index if necessary
df = df.set_index('id')
#create MultiIndex
df.columns = df.columns.str.split('_', expand=True)
#reshape by stack and remove 0 rows
df = df.stack().reset_index().query('T != 0').drop('T',1).rename(columns={'level_1':'T'})
print (df)
id T
1 id1 40
2 id2 30
EDIT:
col_name = 'Course'
df.columns = df.columns.str.split('_', expand=True)
df = (df.replace(0, np.nan)
.stack()
.reset_index()
.drop([col_name, 'level_0'],1)
.rename(columns={'level_1':col_name})
)
print (df)
Course
0 03
1 01
2 02
I think melt() was pretty much made for this?
Your data, I think:
df_input = pd.DataFrame.from_dict({'Course_01':[0,1,0],
'Course_02':[0,0,1],
'Course_03':[1,0,0]})
Change names to match your desired output:
df_input.columns = df_input.columns.str.replace('Course_','')
Melt the dataframe:
dataMelted = pd.melt(df_input,
var_name='Course',
ignore_index=False)
Clean up zeros, etc:
df_output = (dataMelted[dataMelted['value'] != 0]
.drop('value', axis=1)
.sort_index())
>>> df_output
Course
0 03
1 01
2 02
#Create a new column for the categorical
df['categ']=0
for i in range(df):
if df['Course01']==1:
df['categ']='01'
if df['Course02']==1:
df['categ']='02'
if df['Course03']==1:
df['categ']='03'
df.categ.astype('category']
I would like to reverse a dataframe with dummy variables. For example,
from df_input:
Course_01 Course_02 Course_03
0 0 1
1 0 0
0 1 0
To df_output
Course
0 03
1 01
2 02
I have been looking at the solution provided at Reconstruct a categorical variable from dummies in pandas but it did not work. Please, Any help would be much appreciated.
Many Thanks,
Best Regards,
Carlo
Suppose you have the following dummy DF:
In [152]: d
Out[152]:
id T_30 T_40 T_50
0 id1 0 1 1
1 id2 1 0 1
we can prepare the following helper Series:
In [153]: v = pd.Series(d.columns.drop('id').str.replace(r'D','').astype(int), index=d.columns.drop('id'))
In [155]: v
Out[155]:
T_30 30
T_40 40
T_50 50
dtype: int64
now we can multiply them, stack and filter:
In [154]: d.set_index('id').mul(v).stack().reset_index(name='T').drop('level_1',1).query("T > 0")
Out[154]:
id T
1 id1 40
2 id1 50
3 id2 30
5 id2 50
We can use wide_to_long
, then select rows that are not equal to zero i.e
ndf = pd.wide_to_long(df, stubnames='T_', i='id',j='T')
T_
id T
id1 30 0
id2 30 1
id1 40 1
id2 40 0
not_dummy = ndf[ndf['T_'].ne(0)].reset_index().drop('T_',1)
id T
0 id2 30
1 id1 40
Update based on your edit :
ndf = pd.wide_to_long(df.reset_index(), stubnames='T_',i='index',j='T')
not_dummy = ndf[ndf['T_'].ne(0)].reset_index(level='T').drop('T_',1)
T
index
1 30
0 40
You can use:
#create id to index if necessary
df = df.set_index('id')
#create MultiIndex
df.columns = df.columns.str.split('_', expand=True)
#reshape by stack and remove 0 rows
df = df.stack().reset_index().query('T != 0').drop('T',1).rename(columns={'level_1':'T'})
print (df)
id T
1 id1 40
2 id2 30
EDIT:
col_name = 'Course'
df.columns = df.columns.str.split('_', expand=True)
df = (df.replace(0, np.nan)
.stack()
.reset_index()
.drop([col_name, 'level_0'],1)
.rename(columns={'level_1':col_name})
)
print (df)
Course
0 03
1 01
2 02
I think melt() was pretty much made for this?
Your data, I think:
df_input = pd.DataFrame.from_dict({'Course_01':[0,1,0],
'Course_02':[0,0,1],
'Course_03':[1,0,0]})
Change names to match your desired output:
df_input.columns = df_input.columns.str.replace('Course_','')
Melt the dataframe:
dataMelted = pd.melt(df_input,
var_name='Course',
ignore_index=False)
Clean up zeros, etc:
df_output = (dataMelted[dataMelted['value'] != 0]
.drop('value', axis=1)
.sort_index())
>>> df_output
Course
0 03
1 01
2 02
#Create a new column for the categorical
df['categ']=0
for i in range(df):
if df['Course01']==1:
df['categ']='01'
if df['Course02']==1:
df['categ']='02'
if df['Course03']==1:
df['categ']='03'
df.categ.astype('category']