Conditionally slice a pandas dataframe based on a start and a stop value and monotonically increasing

Question:

I am trying to a slice (extract a subset from a) huge dataframe based on a start and a stop value (strictly speaking, not a between possible range) of a speed, and it has to be monotonically increasing.

I am looking for the range to be strictly bounded by the start and stop values. The extracted range has to start with a start value and end with the stop value, and has to be monotonically increasing.

Here is an example:

import pandas as pd 
import numpy as np

speed = [0,1,2,3,4,5,6,7,8,9,0,11,12,13,14,15,14,13,11,12,15,12,14,16,11,10,9,5,12,20,21,22,25,27,32,34,35,30,20]
df = pd.DataFrame(speed,columns = 'speed') 

I want to only extract the slice where the speed starts for example at 20 and stops at 35, I don’t care afterwards, all I care about is to extract this slice with a set start and end value point, and it has to be monotonically increasing.

The typical result for the example would look like this :

Result

20
21
22
25
27
32
34
35

found this answer with the XOR ^ operator and a cummax, but adding the monotocity condition like is_monotnoic() or diff() >=0 messes things up and the reset_index with loc idea mentioned in the same post does not seem to work as well.

It seems trivial to do it with pandas, but could not find a sufficient answer.

Thanks in advance

Asked By: SomeOne

||

Answers:

For those of you who are looking for the same thing, Here is how I got it. The basic idea is to extract the locations (indexes) of the start and stop values in the dataframe and then slice those ranges with pandas’s loc, with some loops and conditions. Let’s elaborate…

With the example I mentioned, namely, someone speeding up from 50 kmh to 100 kmh without slowing down, We first generate two speed ranges that satisfy that condition and bury them in another larger list, in order to later find them. These two lists are of different stepping speeds, one incrementally increasing by one and the other by five :

import pandas as pd 
import numpy as np
import random

speed_1 = []
speed_2 = []`

for i in range(51): 
    speed_1.append(i+50)

for i in range (11): 
    speed_2.append(i*5 +50)

filler_1 = random.sample(range(0, 1000), 1000) # random lists of numbers
filler_2 = random.sample(range(0, 1000), 1000)

Speed = filler_1 + speed_1 + filler_2 + speed_2
df = pd.DataFrame({'Speed': Speed})

Now that we have setup the example, we First have to find all the locations (indexes) of these start and stop values (the 50s and 100s values) in the dataframe, we’ll use the index for that :

x = df.index[df['Speed']==50].tolist()
y = df.index[df['Speed']==100].tolist()

These x and y lists are of the indexes of the start and stop values in the speed column, we have to loop over all the possibilities and find those ranges. For that, we have to use a two-level for-loop with some filtering for monotonic increasing:

result = []
for i in range(len(x)):
    for j in range(len(y)): 
        if x[i]<y[j]:  # This to make sure we are going forward not backward
            if df.loc[x[i]:y[j],'Speed'].diff().fillna(1).ge(0).all():# This is to make sure we are only taking the ones which are monotonically increasing--
                result.append(df.loc[x[i]:y[j],'Speed'].tolist())     # if you want it to be strictly monotonically increasing replace 'ge' with 'gt'
                                                                       

At this point, the problem should be solved, well "theoretically" at least, but it does not work like that in reality. In other words, the strict ==50 and ==100 used to extract indexes are not practical, and it will end-up eliminating slightly variating start and stop values, but legitimate ranges (which most of them are). If, for example, a car was speeding from 49.9 km/h and jumps in a fraction of a second (depending on the frequency) to 50.1 km/h and continues to speed up until 100 km/h, we would have not picked this range up, because of the restrictive equality we setup above. So we need to flex that a bit using pandas between. Let’s first create those practical two ranges to our dataframe, where the start and stop values are -not exactly, but- approximately there:

speed_3 = []
speed_4 = []

for i in range (19): 
    speed_3.append(i*3 +49)
    
for i in range (76): 
    speed_4.append(i*0.7 + 48)

Speed = filler_1 + speed_1 + filler_2 + speed_2 + filler_1 + speed_3 + filler_2 + speed_4
df = pd.DataFrame({'Speed': Speed})

now we extract the indexes of the these approximately equal start-and-stop values:

x = df.index[df['Speed'].between(49,50)].tolist()
y = df.index[df['Speed'].between(99,100)].tolist()

But this will create another problem, more redundant, repeated ranges, we can still carry on with the same logic as above and extract everything, then filter with another part after extraction. However, that is not an optimal solution. To eliminate the repeated (almost identical but slightly different ranges due to the many indexes we extracted using the between instead of the restrictive ==) we use standard deviation std() [if you know any other better statistical indicator of range similarity, please let me know in the comments] to decide if the range selected is redundant:


practical_result = []
stand_dev = 0
for i in range(len(x)):
    for j in range(len(y)): 
        if x[i]<y[j]: 
            if df.loc[x[i]:y[j],'Speed'].diff().fillna(1).ge(0).all(): 
                if (df.loc[x[i]:y[j],'Speed'].std() - stand_dev <= -0.3) or (df.loc[x[i]:y[j],'Speed'].std() - stand_dev >= 0.3) : #  Here we are comparing the previous and the current std, if they 
                    practical_result.append(df.loc[x[i]:y[j],'Speed'].tolist())                                                    # approximately equal, we don't take that range. 
                stand_dev = df.loc[x[i]:y[j],'Speed'].std()

We have to be careful with what value standard deviation should be chosen as a threshold, to decide if a range is not redundant. We should find a sweet spot. Not too low that redundant ranges will be included and not too high to eliminate non-redundant ranges.

Here is the complete code :

import pandas as pd 
import numpy as np
import random

# 1.Setting up examples. 
speed_1 = []
speed_2 = []
speed_3 = []
speed_4 = []

## 1.1 Two exact ranges from 50 kmh to 100 kmh. 
for i in range(51): 
    speed_1.append(i+50)

for i in range (11): 
    speed_2.append(i*5 +50)

## 1.2 Two approximate ranges from 50 kmh to 100 kmh     
for i in range (19): 
    speed_3.append(i*3 +49)
    
for i in range (76): 
    speed_4.append(i*0.7 + 48)

## 1.3 Random ranges to mix it up with our above correct ranges. 
filler_1 = random.sample(range(0, 1000), 1000)
filler_2 = random.sample(range(0, 1000), 1000)

## 1.4 Creating the example dataframe
Speed = filler_1 + speed_1 + filler_2 + speed_2 + filler_1 + speed_3 + filler_2 + speed_4
    
df = pd.DataFrame({'Speed': Speed})

# 2. Extracting our ranges

## 2.1 Extracting all the possible  approximate start and stop points from the dataframe. 

x = df.index[df['Speed'].between(49,50)].tolist()
y = df.index[df['Speed'].between(99,100)].tolist()

## 2.2 Two-level nested loops to go all over the possible combination of ranges,
## three conditional if statements to check for, First x[i]<y[i] so the loc 
## would work, Second for monotocity, and third for std() to eliminate
## redundant ranges. 
   
result = []
stand_dev = 0
for i in range(len(x)):
    for j in range(len(y)): 
        if x[i]<y[j]: 
            if df.loc[x[i]:y[j],'Speed'].diff().fillna(1).ge(0).all(): 
                if (df.loc[x[i]:y[j],'Speed'].std() - stand_dev <= -0.3) or (df.loc[x[i]:y[j],'Speed'].std() - stand_dev >= 0.3): # You could use abs() here for a shorter line. 
                    result.append(df.loc[x[i]:y[j],'Speed'].tolist())
                    stand_dev = df.loc[x[i]:y[j],'Speed'].std()

Edit: No need to use standard deviation to eliminate redundant ranges, a better solution is to check if the left limiting index (x[i]) is actually situated after the previous right limiting index (previous y[j]). Here is the new conditional nested loop:

result = []
previous_y_index = 0
for i in range(len(x)):
    for j in range(len(y)): 
        if x[i]<y[j]: 
            if df.loc[x[i]:y[j],'Speed'].diff().fillna(1).ge(0).all(): 
                if x[i] > previous_y_index:  
                    result.append(df.loc[x[i]:y[j],'Speed'].tolist())
                    previous_y_index = y[j]

I left the original answer with the standard deviation option unedited for reference.

Answered By: SomeOne

Here’s a function that takes list of speed, start and end values and returns all monotonically nondecreasing ranges of speeds between start and end. Because it finds all ranges, it returns a nested list.

def my_function(speed: list, start: int, end: int) -> list:
    res, s = [], []
    for x in speed:
        if x == start or s and x >= s[-1] and x != end:
            s.append(x)
        elif s and x == end:
            s.append(x)
            res.append(s)
        elif s and x < s[-1]:
            s = []
    return res

result = my_function(speed, 20, 35)
result
# [[20, 21, 22, 25, 27, 32, 34, 35]]
Answered By: not a robot
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.