Vectorized column creation, where each value is pulled from a different column in a multi-index pandas df

Question

I have a multiindex dataframe like the one below:

import numpy as np
import pandas as pd


np.random.seed(1)
df = pd.DataFrame(
    {
        'trial':  [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
        't':      [0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1],
        'context':[0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3],
        'x' :np.random.rand(40),
        'y' : np.random.rand(40),
        'z' : np.random.rand(40),
        'inferred_context':   [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1],
    }
)

df = df.set_index(["trial", "t","context"])
df= df.unstack('context')
df.columns.set_names(['vals','context'],inplace=True)
df.swaplevel(0,1)

So it has a two-level row (trial,t) and column (vals, context) index.

I want to create a column called "x_inferred_context" which contains the values of "x" for the context specified in "inferred_context" for each trial and time point t.

For example if at row (trial = 0, t = 3), "inferred_context"=0 and "x" at (val = x, context=0) is 0.204452, the "x_inferred_context" column should contain 0.204452 at row (trial=0, t=3). Corresponidngly if at row (trial=1,t=0) the "inferred_context" is 2 and the value of "x" for context 2 is 0.140387, the column "x_inferred_context" at (1,0) should contain 0.140387.

I have achieved this by creating a mask for each possible context and then multiplying the masks by df[‘x’] and summing.

nc = np.unique((df.columns.get_level_values(level=1))).size
mask = pd.DataFrame( data = [(df['inferred_context'] == c).iloc[:,c] for c in range(nc)]).T
df['x_inferred_context'] = (mask*df['x']).sum(axis=1)

I am new to pandas and want to ask what is the correct way to do this? Is there a less verbose, easier to read and more pandas-like way to do this?

Cheers!

Asked By: terraregina

||

Source

Answer 1

Use a double indexing lookup:

idx, col = pd.factorize(df.index.get_level_values('t'))

idx2 = df['inferred_context'].reindex(col, axis=1).to_numpy()[np.arange(len(df)), idx]

df[('x_inferred_context', None)] = df['x'].reindex(col, axis=1).to_numpy()[np.arange(len(df)), idx2]

Output:

vals            x                                       y                      
context       0.0       1.0       2.0       3.0       0.0       1.0       2.0   
trial t                                                                         
0     0  0.417022  0.720324  0.000114  0.302333  0.988861  0.748166  0.280444   
      1  0.146756  0.092339  0.186260  0.345561  0.103226  0.447894  0.908596   
      2  0.396767  0.538817  0.419195  0.685220  0.287775  0.130029  0.019367   
      3  0.204452  0.878117  0.027388  0.670468  0.211628  0.265547  0.491573   
1     0  0.417305  0.558690  0.140387  0.198101  0.574118  0.146729  0.589306   
      1  0.800745  0.968262  0.313424  0.692323  0.102334  0.414056  0.694400   
      2  0.876389  0.894607  0.085044  0.039055  0.049953  0.535896  0.663795   
      3  0.169830  0.878143  0.098347  0.421108  0.944595  0.586555  0.903402   
2     0  0.957890  0.533165  0.691877  0.315516  0.139276  0.807391  0.397677   
      1  0.686501  0.834626  0.018288  0.750144  0.927509  0.347766  0.750812   

vals                      z                               inferred_context  
context       3.0       0.0       1.0       2.0       3.0              0.0   
trial t                                                                      
0     0  0.789279  0.883306  0.623672  0.750942  0.348898                0   
      1  0.293614  0.269928  0.895886  0.428091  0.964840                0   
      2  0.678836  0.663441  0.621696  0.114746  0.949489                0   
      3  0.053363  0.449912  0.578390  0.408137  0.237027                0   
1     0  0.699758  0.903380  0.573679  0.002870  0.617145                2   
      1  0.414179  0.326645  0.527058  0.885942  0.357270                2   
      2  0.514889  0.908535  0.623360  0.015821  0.929437                2   
      3  0.137475  0.690897  0.997323  0.172341  0.137136                2   
2     0  0.165354  0.932595  0.696818  0.066000  0.755463                1   
      1  0.725998  0.753876  0.923025  0.711525  0.124271                1   

vals                x_inferred_context  
context 1.0 2.0 3.0                NaN  
trial t                                 
0     0   0   0   0           0.417022  
      1   0   0   0           0.146756  
      2   0   0   0           0.396767  
      3   0   0   0           0.204452  
1     0   2   2   2           0.140387  
      1   2   2   2           0.313424  
      2   2   2   2           0.085044  
      3   2   2   2           0.098347  
2     0   1   1   1           0.533165  
      1   1   1   1           0.834626

Answered By: mozway

Vectorized column creation, where each value is pulled from a different column in a multi-index pandas df

Question:

Answers: