Pandas merge columns with similar prefixes
Question:
I have a pandas dataframe with binary columns that looks like this:
DEM_HEALTH_PRIV DEM_HEALTH_PRE DEM_HEALTH_HOS DEM_HEALTH_OUT
0 1 0 0
0 0 1 1
I want to take the suffix of each variable and convert the binary variables to one categorical variable that corresponds with the prefix. For example, merge all DEM_HEALTH variables to include a list of "PRE", "HOS", "OTH" etc. where the value of the column is equal to 1.
Output
DEM_HEALTH_PRIV
['PRE']
['HOS','OUT']
Any help would be much appreciated!
Answers:
Try this –
#original dataframe is called df
new_cols = [tuple(i.rsplit('_',1)) for i in df.columns]
new_cols = pd.MultiIndex.from_tuples(new_cols)
df.columns = new_cols
data = df[df==1]
.stack()
.reset_index(-1)
.groupby(level=0)['level_1']
.apply(list)
Explanation
IIUC your data looks something like the following
print(df)
DEM_HEALTH_PRIV DEM_HEALTH_OUT DEM_HEALTH_PRE DEM_HEALTH_HOS
0 0 1 1 1
1 0 1 0 0
2 0 0 1 0
3 0 1 0 0
4 1 0 0 0
5 0 0 1 1
6 1 0 1 0
7 1 0 0 1
8 0 1 0 0
9 0 1 1 0
1. Create multi-index by rsplit
First step is to rsplit (reverse split) the columns by last occurance of "_"
substring. Then create a multi-index, DEM_HEALTH is level 0
and PRE, HOS, etc are level 1
.
new_cols = [tuple(i.rsplit('_',1)) for i in df.columns]
new_cols = pd.MultiIndex.from_tuples(new_cols)
df.columns = new_cols
print(df)
DEM_HEALTH
PRIV OUT PRE HOS
0 0 1 1 1
1 0 1 0 0
2 0 0 1 0
3 0 1 0 0
4 1 0 0 0
5 0 0 1 1
6 1 0 1 0
7 1 0 0 1
8 0 1 0 0
9 0 1 1 0
2. Stack and Groupby over level=0
data = df[df==1]
.stack()
.reset_index(-1)
.groupby(level=0)['level_1']
.apply(list)
0 [HOS, OUT, PRE]
1 [OUT]
2 [PRE]
3 [OUT]
4 [PRIV]
5 [HOS, PRE]
6 [PRE, PRIV]
7 [HOS, PRIV]
8 [OUT]
9 [OUT, PRE]
Name: level_1, dtype: object
I have a pandas dataframe with binary columns that looks like this:
DEM_HEALTH_PRIV DEM_HEALTH_PRE DEM_HEALTH_HOS DEM_HEALTH_OUT
0 1 0 0
0 0 1 1
I want to take the suffix of each variable and convert the binary variables to one categorical variable that corresponds with the prefix. For example, merge all DEM_HEALTH variables to include a list of "PRE", "HOS", "OTH" etc. where the value of the column is equal to 1.
Output
DEM_HEALTH_PRIV
['PRE']
['HOS','OUT']
Any help would be much appreciated!
Try this –
#original dataframe is called df
new_cols = [tuple(i.rsplit('_',1)) for i in df.columns]
new_cols = pd.MultiIndex.from_tuples(new_cols)
df.columns = new_cols
data = df[df==1]
.stack()
.reset_index(-1)
.groupby(level=0)['level_1']
.apply(list)
Explanation
IIUC your data looks something like the following
print(df)
DEM_HEALTH_PRIV DEM_HEALTH_OUT DEM_HEALTH_PRE DEM_HEALTH_HOS
0 0 1 1 1
1 0 1 0 0
2 0 0 1 0
3 0 1 0 0
4 1 0 0 0
5 0 0 1 1
6 1 0 1 0
7 1 0 0 1
8 0 1 0 0
9 0 1 1 0
1. Create multi-index by rsplit
First step is to rsplit (reverse split) the columns by last occurance of "_"
substring. Then create a multi-index, DEM_HEALTH is level 0
and PRE, HOS, etc are level 1
.
new_cols = [tuple(i.rsplit('_',1)) for i in df.columns]
new_cols = pd.MultiIndex.from_tuples(new_cols)
df.columns = new_cols
print(df)
DEM_HEALTH
PRIV OUT PRE HOS
0 0 1 1 1
1 0 1 0 0
2 0 0 1 0
3 0 1 0 0
4 1 0 0 0
5 0 0 1 1
6 1 0 1 0
7 1 0 0 1
8 0 1 0 0
9 0 1 1 0
2. Stack and Groupby over level=0
data = df[df==1]
.stack()
.reset_index(-1)
.groupby(level=0)['level_1']
.apply(list)
0 [HOS, OUT, PRE]
1 [OUT]
2 [PRE]
3 [OUT]
4 [PRIV]
5 [HOS, PRE]
6 [PRE, PRIV]
7 [HOS, PRIV]
8 [OUT]
9 [OUT, PRE]
Name: level_1, dtype: object