Polars – How to create dynamic rolling window for calculations

Question:

Consider the following dataframe of sensor readings laid out along a straight line:

Location Start (KM) Location End (KM) Readings
1 1.1 7
1.1 1.23 null
1.23 1.3 8
1.3 1.34 null
1.34 1.4 null
1.4 1.5 5
1.5 1.65 6

I am trying to create a rolling 150m lookahead window to calculate the percentage of non nulls within that window, expecting the results to be look a little like the below:

Location Start (KM) Location End (KM) Readings Rolling % Non Null Readings (%)
1 1.1 7 67
1.1 1.23 null 13
1.23 1.3 8 47
1.3 1.34 null 33
1.34 1.4 null 60
1.4 1.5 5 100
1.5 1.65 6 100

Alternatively, a centered window would also work (but the example above is for a look-ahead window)

It seems that the style of creating a dynamic rolling window for the above is somewhat supported via the Polars groupby_dynamic but that seems to only work for temporal values, whereas the values in my columns are floats since they represent spatial locations. The rolling_apply method also seems to provide some means to an end, however it creates a rolling window over a fixed number of rows, which doesn’t quite suit this use case as well, as the number of rows to include in the window will defer depending on certain conditions (In this case the length of a particular reading)

How should I go about performing the following rolling calculations? I am trying to avoid writing explicit loops to loop through each row and check multiple conditions, but I can’t seem to figure out how to do so with the inbuilt methods.

I have attached a simple working example below written in Pandas that does the job, but in a naive loop-heavy way and was hoping to implement a faster version in Polars using some variant of its dynamic_groupby or rolling function:

df=pd.DataFrame({'Location Start (KM)':[1,1.1,1.23,1.3,1.34,1.4,1.5],
                'Location End (KM)':[1.1,1.23,1.3,1.34,1.4,1.5,1.65],
                'Readings':[7,np.nan,8,np.nan,np.nan,5,6]
                }) #sample dataframe
#create manual for loop to simulate a rolling forward looking window

def calculate_perc_non_null(row,window_length):
    '''
    Naive function that calculates the percentage of non null readings within a forward rolling window of a dataframe. Takes in:
    Row: pandas series object representing each row of the dataframe
    window_length: float that describes the window length in KMs   
    '''
    window_start=row['Location Start (KM)']
    window_end=window_start+window_length #generate window endpoints
    eligible_readings=df.loc[(df['Location Start (KM)']>=window_start)&(df['Location Start (KM)']<=window_end)]#readings that fall within the specified window
    #calculate number of non-nulls
    nulls=eligible_readings.loc[eligible_readings['Readings'].isnull()].copy() #grab all null values
    nulls.loc[nulls['Location End (KM)']>window_end,'Location End (KM)']=window_end #truncate ends to ensure no values outside the window are taken in
    total_length_of_nulls=(nulls['Location End (KM)']-nulls['Location Start (KM)']).sum()
    non_null_perc=100*(1-total_length_of_nulls/window_length) #calculate percentage of non null as a percentage
    return non_null_perc
df['Rolling % Non Null Readings (%)']=df.apply(lambda x:calculate_perc_non_null(x,window_length=0.15),axis=1)

Which outputs:


 Location Start (KM)  Location End (KM)  Readings  
0                 1.00               1.10       7.0   
1                 1.10               1.23       NaN   
2                 1.23               1.30       8.0   
3                 1.30               1.34       NaN   
4                 1.34               1.40       NaN   
5                 1.40               1.50       5.0   
6                 1.50               1.65       6.0   

   Rolling % Non Null Readings (%)  
0                        66.666667  
1                        13.333333  
2                        46.666667  
3                        33.333333  
4                        60.000000  
5                       100.000000  
6                       100.000000  
Asked By: sjs

||

Answers:

I may not understand what you need, but let’s see if we can get the ball rolling.

Let’s start by using groupby_rolling and see if this gets us close to what is needed.

First, let’s convert the floats (kilometers) to integers (meters). That allows us to specify our period as an integer: 150i. We’ll then calculate a duration (loc_endloc_start) that we’ll use in our calculations.

Also, we’ll create a boolean (is_not_null) which will automatically be upcast to 0 or 1 in our calculations.

The basic idea is to allow groupby_rolling to calculate each window. We’ll use a dot product to calculate our weighted values.

Since each window will likely be more than 150 meters, we’ll need to back out some overage from the last line that represents the amount over 150 (before dividing by 150)

Here’s what this would look like. (Note: I’ve also asked Polars to include the values it sees in each window — as lists — so that the results can be more easily inspected).

window_size = 150
(
    df
    .with_columns([
        (pl.col('^loc.*$') * 1_000).cast(pl.Int64),
        pl.col('readings').is_not_null().alias('is_not_null'),
    ])
    .with_column(
        (pl.col('loc_end') - pl.col('loc_start')).alias('duration'),
    )
    .groupby_rolling(
        index_column='loc_start',
        period=str(window_size) + "i",
        offset="0i",
        closed="left",
    )
    .agg([
        (
            (
                pl.col('duration').dot('is_not_null') -
                (pl.sum('duration') - window_size) * pl.col('is_not_null').last()
            ) / window_size
        ).alias('result'),
        pl.all().suffix('_val_list'),
    ])
)
shape: (7, 7)
┌───────────┬──────────┬────────────────────┬────────────────────┬───────────────────┬──────────────────────┬───────────────────┐
│ loc_start ┆ result   ┆ loc_start_val_list ┆ loc_end_val_list   ┆ readings_val_list ┆ is_not_null_val_list ┆ duration_val_list │
│ ---       ┆ ---      ┆ ---                ┆ ---                ┆ ---               ┆ ---                  ┆ ---               │
│ i64       ┆ f64      ┆ list[i64]          ┆ list[i64]          ┆ list[i64]         ┆ list[bool]           ┆ list[i64]         │
╞═══════════╪══════════╪════════════════════╪════════════════════╪═══════════════════╪══════════════════════╪═══════════════════╡
│ 1000      ┆ 0.666667 ┆ [1000, 1100]       ┆ [1100, 1230]       ┆ [7, null]         ┆ [true, false]        ┆ [100, 130]        │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1100      ┆ 0.133333 ┆ [1100, 1230]       ┆ [1230, 1300]       ┆ [null, 8]         ┆ [false, true]        ┆ [130, 70]         │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1230      ┆ 0.466667 ┆ [1230, 1300, 1340] ┆ [1300, 1340, 1400] ┆ [8, null, null]   ┆ [true, false, false] ┆ [70, 40, 60]      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1300      ┆ 0.333333 ┆ [1300, 1340, 1400] ┆ [1340, 1400, 1500] ┆ [null, null, 5]   ┆ [false, false, true] ┆ [40, 60, 100]     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1340      ┆ 0.6      ┆ [1340, 1400]       ┆ [1400, 1500]       ┆ [null, 5]         ┆ [false, true]        ┆ [60, 100]         │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1400      ┆ 1.0      ┆ [1400, 1500]       ┆ [1500, 1650]       ┆ [5, 6]            ┆ [true, true]         ┆ [100, 150]        │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1500      ┆ 1.0      ┆ [1500]             ┆ [1650]             ┆ [6]               ┆ [true]               ┆ [150]             │
└───────────┴──────────┴────────────────────┴────────────────────┴───────────────────┴──────────────────────┴───────────────────┘

Note: the third value in the result column (0.46667) differs from the result in your posting (0.73).

Does this get the ball rolling?

Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.