How to unstack a dataset to a certain dataframe?

Question:

I have a dataset like this

data = {'weight': ['NaN',2,3,4,'NaN',6,7,8,9,'NaN',11,12,13,14,15],
        'MI': ['NaN', 21, 19, 18, 'NaN',16,15,14,13,'NaN',11,10,9,8,7]}
df = pd.DataFrame(data, index= ['group1', "gene1", "gene2", 'gene3',
                      'group2', "gene1", 'gene21', 'gene4', 'gene7', 'group3',
                      'gene2', 'gene10', 'gene3', 'gene43', 'gene1'])

enter image description here

I need to stack it to gene by group dataframe with MI values. If there are no gene values for particular group the imputated value should be 0.1. ‘weights’ column should be removed. The final dataframe should look like this

enter image description here

Asked By: Yulia Kentieva

||

Answers:

You can use:

m = df['weight'].ne('NaN')

(df[m]
   .set_index((~m).cumsum()[m], append=True)['MI']
   .unstack('weight', fill_value=0.1)
   .add_prefix('group')
 )

Variant with pivot:

m = df['weight'].ne('NaN')

(df.assign(col=(~m).cumsum())
   .loc[m]
   .pivot(columns='col', values='MI')
   .fillna(0.1)
   .add_prefix('group')
 )

Output:

weight group1 group2 group3
gene1      21     16      7
gene10    0.1    0.1     10
gene2      19    0.1     11
gene21    0.1     15    0.1
gene3      18    0.1      9
gene4     0.1     14    0.1
gene43    0.1    0.1      8
gene7     0.1     13    0.1
Answered By: mozway

Another option, still with pivot:

(df
.assign(
     grp = np.where(df.weight.eq('NaN'), df.index, np.nan), 
     group = lambda df: df.grp.ffill())
.loc[lambda df: df.weight.ne('NaN'), ['MI', 'group']]
.pivot(index=None, columns='group', values='MI')
.fillna(0.1)
)

group   group1  group2  group3
gene1     21.0    16.0     7.0
gene10     0.1     0.1    10.0
gene2     19.0     0.1    11.0
gene21     0.1    15.0     0.1
gene3     18.0     0.1     9.0
gene4      0.1    14.0     0.1
gene43     0.1     0.1     8.0
gene7      0.1    13.0     0.1

Another process, for the fun of it, and possibly an inkling into the complexities that pivot abstracts:

# get positions for group1,2,3:
positions = df.weight.eq('NaN').to_numpy().nonzero()[0]
groups = df.index[positions]

#get lengths of rows between group1,2,3:
positions = np.diff(np.append(positions, len(df))) - 1

# compute the cumulative positions and pass it to np.split:
arrays = np.split(df.MI.loc[df.MI.ne('NaN')], np.cumsum(positions))

#concatenate the arrays, get the final output:
pd.concat(arrays, axis = 1, keys = groups).fillna(0.1)

        group1  group2  group3
gene1     21.0    16.0     7.0
gene2     19.0     0.1    11.0
gene3     18.0     0.1     9.0
gene21     0.1    15.0     0.1
gene4      0.1    14.0     0.1
gene7      0.1    13.0     0.1
gene10     0.1     0.1    10.0
gene43     0.1     0.1     8.0
Answered By: sammywemmy
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.