Groupby shift (lagged values) analogue with only Numpy (no pandas)
Question:
I have a dataframe that looks like this:
id date v1
0 0 1983.0 1.574
1 0 1984.0 1.806
2 0 1985.0 4.724
3 1 1986.0 0.320
4 1 1987.0 3.414
... ... ...
107191 9874 1993.0 52.448
107192 9874 1994.0 108.652
107193 9875 1992.0 1.597
107194 9875 1993.0 3.134
107195 9875 1994.0 7.619
I want to generate a new column that has lagged values of v1
ordered by id
. In pandas I would use
df.groupby('id')['v1'].shift(-1)
However, I want to translate this in pure matrix/array form using only Numpy. What is the most straightforward way to get the analogue in Numpy? I need to avoid using pandas tools because I want to use the Numba @jit later on.
Answers:
IIUC, you are looking to implement df.groupby('id')['v1'].shift(-1)
purely in numpy. This is composed of a grouper and a shift method.
A groupby()
equivalent in numpy for 2D array with first grouping column and second value column is –
np.split(arr[:,1], np.unique(arr[:, 0], return_index=True)[1][1:])
And the shift()
equivalent in numpy for a 1D array is –
np.append(np.roll(arr,-1)[:-1], np.nan)
Putting these 2 together you can get what you want –
#2D array with only id and v1 as columns
arr = df[['id','v1']].values
#Groupby based on id
grouper = np.split(arr[:,1], np.unique(arr[:, 0], return_index=True)[1][1:])
#apply shift to grouped elements
shift = [np.append(np.roll(i,-1)[:-1], np.nan) for i in grouper]
#stack them as a single array
new_col = np.hstack(shift)
#set as column
df['shifted'] = new_col
Test on dummy data –
#Dummy data
idx = [0,0,0,0,0,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3]
val = np.arange(len(idx))
arr = np.array([idx, val]).T
df = pd.DataFrame(arr, columns=['id','v1'])
#apply grouped shifting
arr = df[['id','v1']].values
grouper = np.split(arr[:,1], np.unique(arr[:, 0], return_index=True)[1][1:])
shift = [np.append(np.roll(i,-1)[:-1], np.nan) for i in grouper]
new_col = np.hstack(shift)
df['shifted'] = new_col
print(df)
id v1 shifted
0 0 0 1.0
1 0 1 2.0
2 0 2 3.0
3 0 3 4.0
4 0 4 NaN
5 1 5 6.0
6 1 6 7.0
7 1 7 8.0
8 1 8 9.0
9 1 9 10.0
10 1 10 NaN
11 2 11 12.0
12 2 12 13.0
13 2 13 14.0
14 2 14 NaN
15 3 15 16.0
16 3 16 17.0
17 3 17 18.0
18 3 18 19.0
19 3 19 NaN
I have a dataframe that looks like this:
id date v1
0 0 1983.0 1.574
1 0 1984.0 1.806
2 0 1985.0 4.724
3 1 1986.0 0.320
4 1 1987.0 3.414
... ... ...
107191 9874 1993.0 52.448
107192 9874 1994.0 108.652
107193 9875 1992.0 1.597
107194 9875 1993.0 3.134
107195 9875 1994.0 7.619
I want to generate a new column that has lagged values of v1
ordered by id
. In pandas I would use
df.groupby('id')['v1'].shift(-1)
However, I want to translate this in pure matrix/array form using only Numpy. What is the most straightforward way to get the analogue in Numpy? I need to avoid using pandas tools because I want to use the Numba @jit later on.
IIUC, you are looking to implement df.groupby('id')['v1'].shift(-1)
purely in numpy. This is composed of a grouper and a shift method.
A groupby()
equivalent in numpy for 2D array with first grouping column and second value column is –
np.split(arr[:,1], np.unique(arr[:, 0], return_index=True)[1][1:])
And the shift()
equivalent in numpy for a 1D array is –
np.append(np.roll(arr,-1)[:-1], np.nan)
Putting these 2 together you can get what you want –
#2D array with only id and v1 as columns
arr = df[['id','v1']].values
#Groupby based on id
grouper = np.split(arr[:,1], np.unique(arr[:, 0], return_index=True)[1][1:])
#apply shift to grouped elements
shift = [np.append(np.roll(i,-1)[:-1], np.nan) for i in grouper]
#stack them as a single array
new_col = np.hstack(shift)
#set as column
df['shifted'] = new_col
Test on dummy data –
#Dummy data
idx = [0,0,0,0,0,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3]
val = np.arange(len(idx))
arr = np.array([idx, val]).T
df = pd.DataFrame(arr, columns=['id','v1'])
#apply grouped shifting
arr = df[['id','v1']].values
grouper = np.split(arr[:,1], np.unique(arr[:, 0], return_index=True)[1][1:])
shift = [np.append(np.roll(i,-1)[:-1], np.nan) for i in grouper]
new_col = np.hstack(shift)
df['shifted'] = new_col
print(df)
id v1 shifted
0 0 0 1.0
1 0 1 2.0
2 0 2 3.0
3 0 3 4.0
4 0 4 NaN
5 1 5 6.0
6 1 6 7.0
7 1 7 8.0
8 1 8 9.0
9 1 9 10.0
10 1 10 NaN
11 2 11 12.0
12 2 12 13.0
13 2 13 14.0
14 2 14 NaN
15 3 15 16.0
16 3 16 17.0
17 3 17 18.0
18 3 18 19.0
19 3 19 NaN