Pandas merge columns with similar prefixes

Question:

I have a pandas dataframe with binary columns that looks like this:

DEM_HEALTH_PRIV  DEM_HEALTH_PRE  DEM_HEALTH_HOS  DEM_HEALTH_OUT
0                        1             0              0
0                        0             1              1

I want to take the suffix of each variable and convert the binary variables to one categorical variable that corresponds with the prefix. For example, merge all DEM_HEALTH variables to include a list of "PRE", "HOS", "OTH" etc. where the value of the column is equal to 1.

Output
DEM_HEALTH_PRIV 
['PRE']                      
['HOS','OUT']              

Any help would be much appreciated!

Asked By: user14140004

||

Answers:

Try this –

#original dataframe is called df

new_cols = [tuple(i.rsplit('_',1)) for i in df.columns]
new_cols = pd.MultiIndex.from_tuples(new_cols)
df.columns = new_cols
data = df[df==1]
        .stack()
        .reset_index(-1)
        .groupby(level=0)['level_1']
        .apply(list)

Explanation

IIUC your data looks something like the following

print(df)

   DEM_HEALTH_PRIV  DEM_HEALTH_OUT  DEM_HEALTH_PRE  DEM_HEALTH_HOS
0                0               1               1               1
1                0               1               0               0
2                0               0               1               0
3                0               1               0               0
4                1               0               0               0
5                0               0               1               1
6                1               0               1               0
7                1               0               0               1
8                0               1               0               0
9                0               1               1               0

1. Create multi-index by rsplit

First step is to rsplit (reverse split) the columns by last occurance of "_" substring. Then create a multi-index, DEM_HEALTH is level 0 and PRE, HOS, etc are level 1.

new_cols = [tuple(i.rsplit('_',1)) for i in df.columns]
new_cols = pd.MultiIndex.from_tuples(new_cols)

df.columns = new_cols
print(df)
  DEM_HEALTH            
        PRIV OUT PRE HOS
0          0   1   1   1
1          0   1   0   0
2          0   0   1   0
3          0   1   0   0
4          1   0   0   0
5          0   0   1   1
6          1   0   1   0
7          1   0   0   1
8          0   1   0   0
9          0   1   1   0

2. Stack and Groupby over level=0

data = df[df==1]
        .stack()
        .reset_index(-1)
        .groupby(level=0)['level_1']
        .apply(list)

0    [HOS, OUT, PRE]
1              [OUT]
2              [PRE]
3              [OUT]
4             [PRIV]
5         [HOS, PRE]
6        [PRE, PRIV]
7        [HOS, PRIV]
8              [OUT]
9         [OUT, PRE]
Name: level_1, dtype: object
Answered By: Akshay Sehgal
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.