How to use np.where for creating multiple conditional columns?

Question:

I have a dataframe as follows:

df = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Group1': ['Maintenance', 'Shop', 'Admin', 'Shop'],
    'Hours1': [4, 4, 8, 8],
    'Group2': ['Admin', 'Customer', '0', '0'],
    'Hours2': [4.0, 2.0, 0.0, 0.0],
    'Group3': ['0', 'Admin', '0', '0'],
    'Hours3': [0.0, 2.0, 0.0, 0.0],
})
>>> df
   ID       Group1  Hours1    Group2  Hours2 Group3  Hours3
0   1  Maintenance       4     Admin     4.0      0     0.0
1   2         Shop       4  Customer     2.0  Admin     2.0
2   3        Admin       8         0     0.0      0     0.0
3   4         Shop       8         0     0.0      0     0.0

I would like to create new columns as follows:

desired output:

enter image description here

This is my code and the current output. I understand why it is not giving me what I want but I’m not sure how to modify my code for the desired output

Code:

segment_list=["Maintenance", "Shop", "Admin", "Customer"]

for i in segment_list:
    
    df["Seg_"+i] = np.where((df["Group1"] ==i) | (df["Group2"]==i) | (df["Group3"]==i), 
                            (df["Hours1"] + df["Hours2"] + df["Hours3"])/8,0)

Current output
enter image description here

Asked By: user12715151

||

Answers:

Propably not the cleanest way, but it does work and I couldn’t come up with a more elegant approach.

print(df)
#    ID       Group1  Hours1    Group2  Hours2 Group3  Hours3
# 0   1  Maintenance       4     Admin     4.0    NaN     NaN
# 1   2         Shop       4  Customer     2.0  Admin     2.0
# 2   3        Admin       8       NaN     NaN    NaN     NaN
# 3   4         Shop       8       NaN     NaN    NaN     NaN

df1 = df.melt(id_vars=['ID'], value_vars=['Group1', 'Group2', 'Group3'], value_name='Group')
df2 = df.melt(id_vars=['ID'], value_vars=['Hours1', 'Hours2', 'Hours3'], value_name='Hours')
# We need the Hours column only, so just add it to df1
df1['Hours'] = df2['Hours']

# A lot of ID's will have NaN values for empty groups, so let's remove them.
df1 = df1.sort_values('ID').dropna()

# Now we pivot, where the Groups become the columns.
pvt = df1.pivot(index='ID', columns='Group', values='Hours')

# Calculate the percentage share of each group within a row.
pvt = pvt.apply(lambda r: r/r.sum() , axis=1).reset_index()

#merge the pivot table with the original df on ID.
result = pd.merge(df, pvt, how='inner', on='ID', )
print(result)
#   ID       Group1  Hours1    Group2  Hours2 Group3  Hours3  Admin  Customer  
# 0   1  Maintenance       4     Admin     4.0    NaN     NaN   0.50       NaN   
# 1   2         Shop       4  Customer     2.0  Admin     2.0   0.25      0.25   
# 2   3        Admin       8       NaN     NaN    NaN     NaN   1.00       NaN   
# 3   4         Shop       8       NaN     NaN    NaN     NaN    NaN       NaN   

#    Maintenance  Shop  
# 0          0.5   NaN  
# 1          NaN   0.5  
# 2          NaN   NaN  
# 3          NaN   1.0  

Answered By: JarroVGIT

Here is how I would approach this in a fairly generic way. For a problem like this, I find pandas easier to use (because of groupby and its handling of index and multi-index).

Edit: cleaner way using pd.wide_to_long.

First, some cleaning and reshaping:

z = pd.wide_to_long(
    df, stubnames=['Group', 'Hours'], 
    i='ID',
    j='k')
z = z.loc[~z['Group'].isin({'0', 0})].droplevel('k').set_index(
    'Group', append=True).sort_index()

>>> z
                Hours
ID Group             
1  Admin          4.0
   Maintenance    4.0
2  Admin          2.0
   Customer       2.0
   Shop           4.0
3  Admin          8.0
4  Shop           8.0

In a previous version of this previous answer, I would get the same "long" result "by hand", as follow. Convert each column name ending with one or more digits, such as 'Foo726' into a tuple (prefix, digits), e.g. ('Foo', '726'). These tuples (a list of tuples, one tuple per column name) are converted into a MultiIndex, which is the new column axis. Then use stack and some index manipulations to get the clean, long dataframe:

import re

# set ID as index and clean up the '0' entries
# which really should be NaN (missing data):
df2 = df.set_index('ID').replace({0: np.nan, '0': np.nan})

# then, convert 'Group1', ... into a MultiIndex [(Group, 1), (Hours, 1), ...]
ix = pd.MultiIndex.from_tuples([
    re.match(r'(.*?)(d+)', k).groups() for k in df2.columns])

>>> ix
MultiIndex([('Group', '1'),
            ('Hours', '1'),
            ('Group', '2'),
            ('Hours', '2'),
            ('Group', '3'),
            ('Hours', '3')],
           )

# and convert to a long frame with ['ID', 'Group'] as index
z = df2.set_axis(ix, axis=1).stack(level=1).droplevel(1).set_index(
    'Group', append=True)

Either way, we can now easily calculate the desired summaries (here, just one: fraction of hours relative to ID’s total):

def normalize(g):
    return g / g.sum()

# add some summary stats (fraction of total)
z = z.assign(Seg=z.groupby('ID')['Hours'].transform(normalize))

>>> z
                Hours   Seg
ID Group                   
1  Maintenance    4.0  0.50
   Admin          4.0  0.50
2  Shop           4.0  0.50
   Customer       2.0  0.25
   Admin          2.0  0.25
3  Admin          8.0  1.00
4  Shop           8.0  1.00

At this point, one could simply reshape to wide again, with MultiIndex columns:

>>> z.unstack('Group')
      Hours                             Seg                          
Group Admin Customer Maintenance Shop Admin Customer Maintenance Shop
ID                                                                   
1       4.0      NaN         4.0  NaN  0.50      NaN         0.5  NaN
2       2.0      2.0         NaN  4.0  0.25     0.25         NaN  0.5
3       8.0      NaN         NaN  NaN  1.00      NaN         NaN  NaN
4       NaN      NaN         NaN  8.0   NaN      NaN         NaN  1.0
], axis=1)

Or, closer to the original intention, we can concat horizontally just the Seg portion to the original:

df2 = pd.concat([
    df.set_index('ID'),
    z['Seg'].unstack('Group').rename(columns=lambda s: f'Seg_{s}').fillna(0),
], axis=1)

>>> df2
         Group1  Hours1    Group2  Hours2 Group3  Hours3  Seg_Admin  Seg_Customer  Seg_Maintenance  Seg_Shop
ID                                                                                                          
1   Maintenance       4     Admin     4.0      0     0.0       0.50          0.00              0.5       0.0
2          Shop       4  Customer     2.0  Admin     2.0       0.25          0.25              0.0       0.5
3         Admin       8         0     0.0      0     0.0       1.00          0.00              0.0       0.0
4          Shop       8         0     0.0      0     0.0       0.00          0.00              0.0       1.0
Answered By: Pierre D
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.