Repeatedly removing the maximum average subarray

Question:

I have an array of positive integers. For example:

[1, 7, 8, 4, 2, 1, 4]

A "reduction operation" finds the array prefix with the highest average, and deletes it. Here, an array prefix means a contiguous subarray whose left end is the start of the array, such as [1] or [1, 7] or [1, 7, 8] above. Ties are broken by taking the longer prefix.

Original array:  [  1,   7,   8,   4,   2,   1,   4]

Prefix averages: [1.0, 4.0, 5.3, 5.0, 4.4, 3.8, 3.9]

-> Delete [1, 7, 8], with maximum average 5.3
-> New array -> [4, 2, 1, 4]

I will repeat the reduction operation until the array is empty:

[1, 7, 8, 4, 2, 1, 4]
^       ^
[4, 2, 1, 4]
^ ^
[2, 1, 4]
^       ^
[]

Now, actually performing these array modifications isn’t necessary; I’m only looking for the list of lengths of prefixes that would be deleted by this process, for example, [3, 1, 3] above.

What is an efficient algorithm for computing these prefix lengths?


The naive approach is to recompute all sums and averages from scratch in every iteration for an O(n^2) algorithm– I’ve attached Python code for this below. I’m looking for any improvement on this approach– most preferably, any solution below O(n^2), but an algorithm with the same complexity but better constant factors would also be helpful.

Here are a few of the things I’ve tried (without success):

  1. Dynamically maintaining prefix sums, for example with a Binary Indexed Tree. While I can easily update prefix sums or find a maximum prefix sum in O(log n) time, I haven’t found any data structure which can update the average, as the denominator in the average is changing.
  2. Reusing the previous ‘rankings’ of prefix averages– these rankings can change, e.g. in some array, the prefix ending at index 5 may have a larger average than the prefix ending at index 6, but after removing the first 3 elements, now the prefix ending at index 2 may have a smaller average than the one ending at 3.
  3. Looking for patterns in where prefixes end; for example, the rightmost element of any max average prefix is always a local maximum in the array, but it’s not clear how much this helps.

This is a working Python implementation of the naive, quadratic method:

from fractions import Fraction
def find_array_reductions(nums: List[int]) -> List[int]:
    """Return list of lengths of max average prefix reductions."""

    def max_prefix_avg(arr: List[int]) -> Tuple[float, int]:
        """Return value and length of max average prefix in arr."""
        if len(arr) == 0:
            return (-math.inf, 0)

        best_length = 1
        best_average = Fraction(0, 1)
        running_sum = 0

        for i, x in enumerate(arr, 1):
            running_sum += x
            new_average = Fraction(running_sum, i)
            if new_average >= best_average:
                best_average = new_average
                best_length = i

        return (float(best_average), best_length)

    removed_lengths = []
    total_removed = 0

    while total_removed < len(nums):
        _, new_removal = max_prefix_avg(nums[total_removed:])
        removed_lengths.append(new_removal)
        total_removed += new_removal

    return removed_lengths

Edit: The originally published code had a rare error with large inputs from using Python’s math.isclose() with default parameters for floating point comparison, rather than proper fraction comparison. This has been fixed in the current code. An example of the error can be found at this Try it online link, along with a foreword explaining exactly what causes this bug, if you’re curious.

Asked By: kcsquared

||

Answers:

This problem has a fun O(n) solution.

If you draw a graph of cumulative sum vs index, then:

The average value in the subarray between any two indexes is the slope of the line between those points on the graph.

The first highest-average-prefix will end at the point that makes the highest angle from 0. The next highest-average-prefix must then have a smaller average, and it will end at the point that makes the highest angle from the first ending. Continuing to the end of the array, we find that…

These segments of highest average are exactly the segments in the upper convex hull of the cumulative sum graph.

Find these segments using the monotone chain algorithm. Since the points are already sorted, it takes O(n) time.

# Lengths of the segments in the upper convex hull
# of the cumulative sum graph
def upperSumHullLengths(arr):
    if len(arr) < 2:
        if len(arr) < 1:
            return []
        else:
            return [1]
    
    hull = [(0, 0),(1, arr[0])]
    for x in range(2, len(arr)+1):
        # this has x coordinate x-1
        prevPoint = hull[len(hull) - 1]
        # next point in cumulative sum
        point = (x, prevPoint[1] + arr[x-1])
        # remove points not on the convex hull
        while len(hull) >= 2:
            p0 = hull[len(hull)-2]
            dx0 = prevPoint[0] - p0[0]
            dy0 = prevPoint[1] - p0[1]
            dx1 = x - prevPoint[0]
            dy1 = point[1] - prevPoint[1]
            if dy1*dx0 < dy0*dx1:
                break
            hull.pop()
            prevPoint = p0
        hull.append(point)
    
    return [hull[i+1][0] - hull[i][0] for i in range(0, len(hull)-1)]


print(upperSumHullLengths([  1,   7,   8,   4,   2,   1,   4]))

prints:

[3, 1, 3]
Answered By: Matt Timmermans

Somewhat simplified versions of Matt’s and kcsquared’s solutions and some benchmarks:

from itertools import accumulate, pairwise

def Matt_Pychoed(arr):
    hull = [(0, 0)]
    for x, y in enumerate(accumulate(arr), 1):
        while len(hull) >= 2:
            (x0, y0), (x1, y1) = hull[-2:]
            dx0 = x1 - x0
            dy0 = y1 - y0
            dx1 = x - x1
            dy1 = y - y1
            if dy1*dx0 < dy0*dx1:
                break
            hull.pop()
        hull.append((x, y))
    return [q[0] - p[0] for p, q in pairwise(hull)]
from itertools import accumulate, count
from operator import truediv

def kc_Pychoed_2(nums):
    removals = []
    while nums:
        averages = map(truediv, accumulate(nums), count(1))
        remove = max(zip(averages, count(1)))[1]
        removals.append(remove)
        nums = nums[remove:]
    return removals

Benchmark with twenty different arrays of 100,000 random integers from 1 to 1000:

  min   median   mean     max  
 65 ms  164 ms  159 ms  249 ms  kc
 38 ms   98 ms   92 ms  146 ms  kc_Pychoed_1
 58 ms  127 ms  120 ms  189 ms  kc_Pychoed_2
134 ms  137 ms  138 ms  157 ms  Matt
101 ms  102 ms  103 ms  111 ms  Matt_Pychoed

Where kc_Pychoed_1 is kcsquared’s but with integer running_sum and without math.isclose. And I verify that all solutions compute the same result for every input.

For such random data, kcsquared’s appears to be between O(n) and O(n log n). But it degrades to quadratic if the array is strictly decreasing. For arr = [1000, 999, 998, ..., 2, 1] I got:

  min   median   mean     max  
102 ms  106 ms  107 ms  116 ms  kc
 60 ms   61 ms   61 ms   62 ms  kc_Pychoed_1
 76 ms   77 ms   77 ms   86 ms  kc_Pychoed_2
  0 ms    1 ms    1 ms    1 ms  Matt
  0 ms    0 ms    0 ms    0 ms  Matt_Pychoed

Benchmark code (Try it online!):

from timeit import default_timer as timer
from statistics import mean, median
import random
from typing import List, Tuple
import math
from itertools import accumulate, count
from operator import truediv

def kc(nums: List[int]) -> List[int]:
    """Return list of lengths of max average prefix reductions."""

    def max_prefix_avg(arr: List[int]) -> Tuple[float, int]:
        """Return value and length of max average prefix in arr"""
        if len(arr) == 0:
            return (-math.inf, 0)
        
        best_length = 1
        best_average = -math.inf
        running_sum = 0.0

        for i, x in enumerate(arr, 1):
            running_sum += x
            new_average = running_sum / i
            
            if (new_average >= best_average
                or math.isclose(new_average, best_average)):
                
                best_average = new_average
                best_length = i

        return (best_average, best_length)

    removed_lengths = []
    total_removed = 0

    while total_removed < len(nums):
        _, new_removal = max_prefix_avg(nums[total_removed:])
        removed_lengths.append(new_removal)
        total_removed += new_removal

    return removed_lengths

def kc_Pychoed_1(nums: List[int]) -> List[int]:
    """Return list of lengths of max average prefix reductions."""

    def max_prefix_avg(arr: List[int]) -> Tuple[float, int]:
        """Return value and length of max average prefix in arr"""
        if len(arr) == 0:
            return (-math.inf, 0)
        
        best_length = 1
        best_average = -math.inf
        running_sum = 0

        for i, x in enumerate(arr, 1):
            running_sum += x
            new_average = running_sum / i
            
            if new_average >= best_average:
                
                best_average = new_average
                best_length = i

        return (best_average, best_length)

    removed_lengths = []
    total_removed = 0

    while total_removed < len(nums):
        _, new_removal = max_prefix_avg(nums[total_removed:])
        removed_lengths.append(new_removal)
        total_removed += new_removal

    return removed_lengths

def kc_Pychoed_2(nums):
    removals = []
    while nums:
        averages = map(truediv, accumulate(nums), count(1))
        remove = max(zip(averages, count(1)))[1]
        removals.append(remove)
        nums = nums[remove:]
    return removals

# Lengths of the segments in the upper convex hull
# of the cumulative sum graph
def Matt(arr):
    if len(arr) < 2:
        if len(arr) < 1:
            return []
        else:
            return [1]
    
    hull = [(0, 0),(1, arr[0])]
    for x in range(2, len(arr)+1):
        # this has x coordinate x-1
        prevPoint = hull[len(hull) - 1]
        # next point in cumulative sum
        point = (x, prevPoint[1] + arr[x-1])
        # remove points not on the convex hull
        while len(hull) >= 2:
            p0 = hull[len(hull)-2]
            dx0 = prevPoint[0] - p0[0]
            dy0 = prevPoint[1] - p0[1]
            dx1 = x - prevPoint[0]
            dy1 = point[1] - prevPoint[1]
            if dy1*dx0 < dy0*dx1:
                break
            hull.pop()
            prevPoint = p0
        hull.append(point)
    
    return [hull[i+1][0] - hull[i][0] for i in range(0, len(hull)-1)]

def pairwise(lst):
    return zip(lst, lst[1:])

def Matt_Pychoed(arr):
    hull = [(0, 0)]
    for x, y in enumerate(accumulate(arr), 1):
        while len(hull) >= 2:
            (x0, y0), (x1, y1) = hull[-2:]
            dx0 = x1 - x0
            dy0 = y1 - y0
            dx1 = x - x1
            dy1 = y - y1
            if dy1*dx0 < dy0*dx1:
                break
            hull.pop()
        hull.append((x, y))
    return [q[0] - p[0] for p, q in pairwise(hull)]

funcs = kc, kc_Pychoed_1, kc_Pychoed_2, Matt, Matt_Pychoed
stats = min, median, mean, max
tss = [[] for _ in funcs]
for r in range(1, 21):
    print(f'After round {r}:')
    arr = random.choices(range(1, 1001), k=100_000)
    # arr = list(range(1000, 1, -1))
    expect = None
    print(*(f'{stat.__name__:^7}' for stat in stats))
    for func, ts in zip(funcs, tss):
        t0 = timer()
        result = func(arr)
        t1 = timer()
        ts.append(t1 - t0)
        if expect is None:
            expect = result
        assert result == expect
        print(*('%3d ms ' % (stat(ts) * 1e3) for stat in stats), func.__name__)
    print()
Answered By: Pychopath