Pandas – for loop and find index that has the nearest value

Question:

I am basically trying to loop through a dataframe that has been grouped and finding the index that has the nearest value to the input argument.

For example, given the dataframe below, for every group defined by the global_id, I want to group to take frames that are spaced at least 10 frames apart. For example if I have a list of frames [1,2,3,4,14,20,30,31], the output would be [1,14,30] because

  • I would initialize by taking frame 1 as the first frame
  • The next frame that is at least 10 frames apart would be frame number 14
  • The following frame that is at least 10 frame apart from 14 is 30

As such, the resulting before and after dataframe should look like below

Before:

       seq_name     label pedestrian_id  frame_no  global_id
0          0001  crossing          0001  0001         1
1          0001  crossing          0001  0002         1
2          0001  crossing          0001  0003         1
3          0001  crossing          0001  0004         1
4          0001  crossing          0001  0005         1
5          0001  crossing          0001  0006         1
6          0001  crossing          0001  0007         1
7          0001  crossing          0001  0008         1
8          0001  crossing          0001  0009         1
9          0001  crossing          0001  0010         1
10         0001  crossing          0002  0001         2
11         0001  crossing          0002  0012         2
12         0001  crossing          0002  0013         2
13         0001  crossing          0002  0014         2
14         0001  crossing          0002  0015         2
15         0001  crossing          0002  0029         2
16         0001  crossing          0002  0030         2
17         0001  crossing          0002  0031         2
18         0001  crossing          0002  0032         2
19         0001  crossing          0002  0033         2
20         0002  crossing          0001  0034         3
21         0002  crossing          0001  0035         3
22         0002  crossing          0001  0036         3
23         0002  crossing          0001  0037         3
24         0002  crossing          0001  0038         3
25         0002  crossing          0001  0039         3
26         0002  crossing          0001  0049         3
27         0002  crossing          0001  0050         3
28         0002  crossing          0001  0051         3
29         0002  crossing          0001  0052         3

After filter:

       seq_name     label pedestrian_id  frame_no  global_id
0          0001  crossing          0001  0001         1
10         0001  crossing          0002  0001         2
11         0001  crossing          0002  0012         2
15         0001  crossing          0002  0029         2
25         0002  crossing          0001  0039         3
26         0002  crossing          0001  0049         3

Below is what I have. Once I have the indices I can create a new dataframe by indexing from the old. I am still new to Pandas and it looks extremely cumbersome so I am hoping there is a more elegant solution.

I have read through the docs on groupby and some other SO posts but still can’t figure it out. I am just trying to clean up my data processing pipeline by replacing everything with Pandas.

ind = []
for j in df["global_id"].unique():
    df_temp = df[df["global_id"] == j][["frame_no"]]
    df_temp["frame_no"] = pd.to_numeric(df["frame_no"]) 
    start_frame = df_temp["frame_no"].min()
    end_frame = df_temp["frame_no"].max()
    i = start_frame-1
    while i < end_frame:
        ind.append(np.min(df_temp[(df_temp["frame_no"] > i) & (df_temp["frame_no"] < i+10)].index.tolist()))
        i+=10
Asked By: Kong

||

Answers:

Here is one way using groupby but first you need to define a function doing what you look for in each group. To explain the idea, let’s consider a simple dataframe dfs = pd.DataFrame({'a':[1,2,3,4,14,20,30,31]})

I have been looking to solve this kind of problem for a while, trying to avoid looping and it seems complex. Here is the idea I end up with. In numpy, you can use substract combined with outer to get all the differences between each element one to one

print (np.subtract.outer(dfs.a, dfs.a))
array([[  0,  -1,  -2,  -3, -13, -19, -29, -30],
       [  1,   0,  -1,  -2, -12, -18, -28, -29],
       [  2,   1,   0,  -1, -11, -17, -27, -28],
       [  3,   2,   1,   0, -10, -16, -26, -27],
       [ 13,  12,  11,  10,   0,  -6, -16, -17],
       [ 19,  18,  17,  16,   6,   0, -10, -11],
       [ 29,  28,  27,  26,  16,  10,   0,  -1],
       [ 30,  29,  28,  27,  17,  11,   1,   0]], dtype=int64)

Now, for example, in column 0, you can see that difference >10 start at row 4, then going to column 4, difference >10 start at row 6 and going to column 6 you don’t get a difference big enough. So the filtering would be keeping row 0, 4 and 6, which is the values [1,14,30] as looking for. To get these numbers, you can compare the np.substract.outer to 10 and sum over the axis=0 such as:

arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
print (arr)
array([4, 4, 4, 5, 6, 7, 8, 8])

Now you see, arr[0] = 4, then arr[4] = 6, then arr[6]=8 is out of bound in this example so stop. One way to catch these number is using a while (if someone has a numpy solution to do this, I’m interested in)

list_ind = [0] # initialize list of index to keep with 0
arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
i = arr[0]
while i < len(arr):
    list_ind.append(i)
    i = arr[i]

print (list_ind)
[0, 4, 6]

print (dfs.iloc[list_ind])
    a
0   1
4  14
6  30

Now with the whole problem and groupby, you can do:

# it seems you need to convert the column frame_no to integer
df['frame_int'] = pd.to_numeric(df['frame_no'])
df = df.sort_values('frame_int') #ensure data to be sorted by frame_int, whatever the global_id

#define the function looking for the ind
def find_ind (df_g):
    list_ind = [0]
    arr = (np.subtract.outer(df_g.frame_int, df_g.frame_int) <= 10).sum(0)
    i = arr[0]
    while i <len(arr):
        list_ind.append(i)
        i = arr[i]
    return df_g.iloc[list_ind]

#create the filtered dataframe
df_filtered = (df.groupby('global_id').apply(find_ind)
                 .drop('frame_int',axis=1).reset_index(drop=True))

print (df_filtered)
   seq_name     label  pedestrian_id  frame_no  global_id
0         1  crossing              1         1          1
1         1  crossing              2         1          2
2         1  crossing              2        12          2
3         1  crossing              2        29          2
4         2  crossing              1        34          3
5         2  crossing              1        49          3

If you want to keep the original rows’ index, you can add level=0 in reset_index such as reset_index(level=0,drop=True) instead.

Answered By: Ben.T
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.