Groupby shift (lagged values) analogue with only Numpy (no pandas)

Question

I have a dataframe that looks like this:

          id    date       v1
0          0  1983.0    1.574
1          0  1984.0    1.806
2          0  1985.0    4.724
3          1  1986.0    0.320
4          1  1987.0    3.414
     ...     ...      ...
107191  9874  1993.0   52.448
107192  9874  1994.0  108.652
107193  9875  1992.0    1.597
107194  9875  1993.0    3.134
107195  9875  1994.0    7.619

I want to generate a new column that has lagged values of v1 ordered by id. In pandas I would use

df.groupby('id')['v1'].shift(-1)

However, I want to translate this in pure matrix/array form using only Numpy. What is the most straightforward way to get the analogue in Numpy? I need to avoid using pandas tools because I want to use the Numba @jit later on.

Asked By: Matteo

||

Source

Answer 1

IIUC, you are looking to implement df.groupby('id')['v1'].shift(-1) purely in numpy. This is composed of a grouper and a shift method.

A groupby() equivalent in numpy for 2D array with first grouping column and second value column is –

np.split(arr[:,1], np.unique(arr[:, 0], return_index=True)[1][1:])

And the shift() equivalent in numpy for a 1D array is –

np.append(np.roll(arr,-1)[:-1], np.nan)

Putting these 2 together you can get what you want –

#2D array with only id and v1 as columns
arr = df[['id','v1']].values   

#Groupby based on id
grouper = np.split(arr[:,1], np.unique(arr[:, 0], return_index=True)[1][1:]) 

#apply shift to grouped elements
shift = [np.append(np.roll(i,-1)[:-1], np.nan) for i in grouper] 

#stack them as a single array
new_col = np.hstack(shift) 

#set as column
df['shifted'] = new_col

Test on dummy data –

#Dummy data
idx = [0,0,0,0,0,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3]
val = np.arange(len(idx))
arr = np.array([idx, val]).T
df = pd.DataFrame(arr, columns=['id','v1'])

#apply grouped shifting
arr = df[['id','v1']].values
grouper = np.split(arr[:,1], np.unique(arr[:, 0], return_index=True)[1][1:])
shift = [np.append(np.roll(i,-1)[:-1], np.nan) for i in grouper]
new_col = np.hstack(shift)
df['shifted'] = new_col

print(df)

    id  v1  shifted
0    0   0      1.0
1    0   1      2.0
2    0   2      3.0
3    0   3      4.0
4    0   4      NaN
5    1   5      6.0
6    1   6      7.0
7    1   7      8.0
8    1   8      9.0
9    1   9     10.0
10   1  10      NaN
11   2  11     12.0
12   2  12     13.0
13   2  13     14.0
14   2  14      NaN
15   3  15     16.0
16   3  16     17.0
17   3  17     18.0
18   3  18     19.0
19   3  19      NaN

Answered By: Akshay Sehgal

Groupby shift (lagged values) analogue with only Numpy (no pandas)

Question:

Answers:

Test on dummy data –