Repeatedly removing the maximum average subarray
Question:
I have an array of positive integers. For example:
[1, 7, 8, 4, 2, 1, 4]
A "reduction operation" finds the array prefix with the highest average, and deletes it. Here, an array prefix means a contiguous subarray whose left end is the start of the array, such as [1]
or [1, 7]
or [1, 7, 8]
above. Ties are broken by taking the longer prefix.
Original array: [ 1, 7, 8, 4, 2, 1, 4]
Prefix averages: [1.0, 4.0, 5.3, 5.0, 4.4, 3.8, 3.9]
-> Delete [1, 7, 8], with maximum average 5.3
-> New array -> [4, 2, 1, 4]
I will repeat the reduction operation until the array is empty:
[1, 7, 8, 4, 2, 1, 4]
^ ^
[4, 2, 1, 4]
^ ^
[2, 1, 4]
^ ^
[]
Now, actually performing these array modifications isn’t necessary; I’m only looking for the list of lengths of prefixes that would be deleted by this process, for example, [3, 1, 3]
above.
What is an efficient algorithm for computing these prefix lengths?
The naive approach is to recompute all sums and averages from scratch in every iteration for an O(n^2)
algorithm– I’ve attached Python code for this below. I’m looking for any improvement on this approach– most preferably, any solution below O(n^2)
, but an algorithm with the same complexity but better constant factors would also be helpful.
Here are a few of the things I’ve tried (without success):
- Dynamically maintaining prefix sums, for example with a Binary Indexed Tree. While I can easily update prefix sums or find a maximum prefix sum in
O(log n)
time, I haven’t found any data structure which can update the average, as the denominator in the average is changing.
- Reusing the previous ‘rankings’ of prefix averages– these rankings can change, e.g. in some array, the prefix ending at index
5
may have a larger average than the prefix ending at index 6
, but after removing the first 3 elements, now the prefix ending at index 2
may have a smaller average than the one ending at 3
.
- Looking for patterns in where prefixes end; for example, the rightmost element of any max average prefix is always a local maximum in the array, but it’s not clear how much this helps.
This is a working Python implementation of the naive, quadratic method:
from fractions import Fraction
def find_array_reductions(nums: List[int]) -> List[int]:
"""Return list of lengths of max average prefix reductions."""
def max_prefix_avg(arr: List[int]) -> Tuple[float, int]:
"""Return value and length of max average prefix in arr."""
if len(arr) == 0:
return (-math.inf, 0)
best_length = 1
best_average = Fraction(0, 1)
running_sum = 0
for i, x in enumerate(arr, 1):
running_sum += x
new_average = Fraction(running_sum, i)
if new_average >= best_average:
best_average = new_average
best_length = i
return (float(best_average), best_length)
removed_lengths = []
total_removed = 0
while total_removed < len(nums):
_, new_removal = max_prefix_avg(nums[total_removed:])
removed_lengths.append(new_removal)
total_removed += new_removal
return removed_lengths
Edit: The originally published code had a rare error with large inputs from using Python’s math.isclose()
with default parameters for floating point comparison, rather than proper fraction comparison. This has been fixed in the current code. An example of the error can be found at this Try it online link, along with a foreword explaining exactly what causes this bug, if you’re curious.
Answers:
This problem has a fun O(n) solution.
If you draw a graph of cumulative sum vs index, then:
The average value in the subarray between any two indexes is the slope of the line between those points on the graph.
The first highest-average-prefix will end at the point that makes the highest angle from 0. The next highest-average-prefix must then have a smaller average, and it will end at the point that makes the highest angle from the first ending. Continuing to the end of the array, we find that…
These segments of highest average are exactly the segments in the upper convex hull of the cumulative sum graph.
Find these segments using the monotone chain algorithm. Since the points are already sorted, it takes O(n) time.
# Lengths of the segments in the upper convex hull
# of the cumulative sum graph
def upperSumHullLengths(arr):
if len(arr) < 2:
if len(arr) < 1:
return []
else:
return [1]
hull = [(0, 0),(1, arr[0])]
for x in range(2, len(arr)+1):
# this has x coordinate x-1
prevPoint = hull[len(hull) - 1]
# next point in cumulative sum
point = (x, prevPoint[1] + arr[x-1])
# remove points not on the convex hull
while len(hull) >= 2:
p0 = hull[len(hull)-2]
dx0 = prevPoint[0] - p0[0]
dy0 = prevPoint[1] - p0[1]
dx1 = x - prevPoint[0]
dy1 = point[1] - prevPoint[1]
if dy1*dx0 < dy0*dx1:
break
hull.pop()
prevPoint = p0
hull.append(point)
return [hull[i+1][0] - hull[i][0] for i in range(0, len(hull)-1)]
print(upperSumHullLengths([ 1, 7, 8, 4, 2, 1, 4]))
prints:
[3, 1, 3]
Somewhat simplified versions of Matt’s and kcsquared’s solutions and some benchmarks:
from itertools import accumulate, pairwise
def Matt_Pychoed(arr):
hull = [(0, 0)]
for x, y in enumerate(accumulate(arr), 1):
while len(hull) >= 2:
(x0, y0), (x1, y1) = hull[-2:]
dx0 = x1 - x0
dy0 = y1 - y0
dx1 = x - x1
dy1 = y - y1
if dy1*dx0 < dy0*dx1:
break
hull.pop()
hull.append((x, y))
return [q[0] - p[0] for p, q in pairwise(hull)]
from itertools import accumulate, count
from operator import truediv
def kc_Pychoed_2(nums):
removals = []
while nums:
averages = map(truediv, accumulate(nums), count(1))
remove = max(zip(averages, count(1)))[1]
removals.append(remove)
nums = nums[remove:]
return removals
Benchmark with twenty different arrays of 100,000 random integers from 1 to 1000:
min median mean max
65 ms 164 ms 159 ms 249 ms kc
38 ms 98 ms 92 ms 146 ms kc_Pychoed_1
58 ms 127 ms 120 ms 189 ms kc_Pychoed_2
134 ms 137 ms 138 ms 157 ms Matt
101 ms 102 ms 103 ms 111 ms Matt_Pychoed
Where kc_Pychoed_1
is kcsquared’s but with integer running_sum
and without math.isclose
. And I verify that all solutions compute the same result for every input.
For such random data, kcsquared’s appears to be between O(n) and O(n log n). But it degrades to quadratic if the array is strictly decreasing. For arr = [1000, 999, 998, ..., 2, 1]
I got:
min median mean max
102 ms 106 ms 107 ms 116 ms kc
60 ms 61 ms 61 ms 62 ms kc_Pychoed_1
76 ms 77 ms 77 ms 86 ms kc_Pychoed_2
0 ms 1 ms 1 ms 1 ms Matt
0 ms 0 ms 0 ms 0 ms Matt_Pychoed
Benchmark code (Try it online!):
from timeit import default_timer as timer
from statistics import mean, median
import random
from typing import List, Tuple
import math
from itertools import accumulate, count
from operator import truediv
def kc(nums: List[int]) -> List[int]:
"""Return list of lengths of max average prefix reductions."""
def max_prefix_avg(arr: List[int]) -> Tuple[float, int]:
"""Return value and length of max average prefix in arr"""
if len(arr) == 0:
return (-math.inf, 0)
best_length = 1
best_average = -math.inf
running_sum = 0.0
for i, x in enumerate(arr, 1):
running_sum += x
new_average = running_sum / i
if (new_average >= best_average
or math.isclose(new_average, best_average)):
best_average = new_average
best_length = i
return (best_average, best_length)
removed_lengths = []
total_removed = 0
while total_removed < len(nums):
_, new_removal = max_prefix_avg(nums[total_removed:])
removed_lengths.append(new_removal)
total_removed += new_removal
return removed_lengths
def kc_Pychoed_1(nums: List[int]) -> List[int]:
"""Return list of lengths of max average prefix reductions."""
def max_prefix_avg(arr: List[int]) -> Tuple[float, int]:
"""Return value and length of max average prefix in arr"""
if len(arr) == 0:
return (-math.inf, 0)
best_length = 1
best_average = -math.inf
running_sum = 0
for i, x in enumerate(arr, 1):
running_sum += x
new_average = running_sum / i
if new_average >= best_average:
best_average = new_average
best_length = i
return (best_average, best_length)
removed_lengths = []
total_removed = 0
while total_removed < len(nums):
_, new_removal = max_prefix_avg(nums[total_removed:])
removed_lengths.append(new_removal)
total_removed += new_removal
return removed_lengths
def kc_Pychoed_2(nums):
removals = []
while nums:
averages = map(truediv, accumulate(nums), count(1))
remove = max(zip(averages, count(1)))[1]
removals.append(remove)
nums = nums[remove:]
return removals
# Lengths of the segments in the upper convex hull
# of the cumulative sum graph
def Matt(arr):
if len(arr) < 2:
if len(arr) < 1:
return []
else:
return [1]
hull = [(0, 0),(1, arr[0])]
for x in range(2, len(arr)+1):
# this has x coordinate x-1
prevPoint = hull[len(hull) - 1]
# next point in cumulative sum
point = (x, prevPoint[1] + arr[x-1])
# remove points not on the convex hull
while len(hull) >= 2:
p0 = hull[len(hull)-2]
dx0 = prevPoint[0] - p0[0]
dy0 = prevPoint[1] - p0[1]
dx1 = x - prevPoint[0]
dy1 = point[1] - prevPoint[1]
if dy1*dx0 < dy0*dx1:
break
hull.pop()
prevPoint = p0
hull.append(point)
return [hull[i+1][0] - hull[i][0] for i in range(0, len(hull)-1)]
def pairwise(lst):
return zip(lst, lst[1:])
def Matt_Pychoed(arr):
hull = [(0, 0)]
for x, y in enumerate(accumulate(arr), 1):
while len(hull) >= 2:
(x0, y0), (x1, y1) = hull[-2:]
dx0 = x1 - x0
dy0 = y1 - y0
dx1 = x - x1
dy1 = y - y1
if dy1*dx0 < dy0*dx1:
break
hull.pop()
hull.append((x, y))
return [q[0] - p[0] for p, q in pairwise(hull)]
funcs = kc, kc_Pychoed_1, kc_Pychoed_2, Matt, Matt_Pychoed
stats = min, median, mean, max
tss = [[] for _ in funcs]
for r in range(1, 21):
print(f'After round {r}:')
arr = random.choices(range(1, 1001), k=100_000)
# arr = list(range(1000, 1, -1))
expect = None
print(*(f'{stat.__name__:^7}' for stat in stats))
for func, ts in zip(funcs, tss):
t0 = timer()
result = func(arr)
t1 = timer()
ts.append(t1 - t0)
if expect is None:
expect = result
assert result == expect
print(*('%3d ms ' % (stat(ts) * 1e3) for stat in stats), func.__name__)
print()
I have an array of positive integers. For example:
[1, 7, 8, 4, 2, 1, 4]
A "reduction operation" finds the array prefix with the highest average, and deletes it. Here, an array prefix means a contiguous subarray whose left end is the start of the array, such as [1]
or [1, 7]
or [1, 7, 8]
above. Ties are broken by taking the longer prefix.
Original array: [ 1, 7, 8, 4, 2, 1, 4]
Prefix averages: [1.0, 4.0, 5.3, 5.0, 4.4, 3.8, 3.9]
-> Delete [1, 7, 8], with maximum average 5.3
-> New array -> [4, 2, 1, 4]
I will repeat the reduction operation until the array is empty:
[1, 7, 8, 4, 2, 1, 4]
^ ^
[4, 2, 1, 4]
^ ^
[2, 1, 4]
^ ^
[]
Now, actually performing these array modifications isn’t necessary; I’m only looking for the list of lengths of prefixes that would be deleted by this process, for example, [3, 1, 3]
above.
What is an efficient algorithm for computing these prefix lengths?
The naive approach is to recompute all sums and averages from scratch in every iteration for an O(n^2)
algorithm– I’ve attached Python code for this below. I’m looking for any improvement on this approach– most preferably, any solution below O(n^2)
, but an algorithm with the same complexity but better constant factors would also be helpful.
Here are a few of the things I’ve tried (without success):
- Dynamically maintaining prefix sums, for example with a Binary Indexed Tree. While I can easily update prefix sums or find a maximum prefix sum in
O(log n)
time, I haven’t found any data structure which can update the average, as the denominator in the average is changing. - Reusing the previous ‘rankings’ of prefix averages– these rankings can change, e.g. in some array, the prefix ending at index
5
may have a larger average than the prefix ending at index6
, but after removing the first 3 elements, now the prefix ending at index2
may have a smaller average than the one ending at3
. - Looking for patterns in where prefixes end; for example, the rightmost element of any max average prefix is always a local maximum in the array, but it’s not clear how much this helps.
This is a working Python implementation of the naive, quadratic method:
from fractions import Fraction
def find_array_reductions(nums: List[int]) -> List[int]:
"""Return list of lengths of max average prefix reductions."""
def max_prefix_avg(arr: List[int]) -> Tuple[float, int]:
"""Return value and length of max average prefix in arr."""
if len(arr) == 0:
return (-math.inf, 0)
best_length = 1
best_average = Fraction(0, 1)
running_sum = 0
for i, x in enumerate(arr, 1):
running_sum += x
new_average = Fraction(running_sum, i)
if new_average >= best_average:
best_average = new_average
best_length = i
return (float(best_average), best_length)
removed_lengths = []
total_removed = 0
while total_removed < len(nums):
_, new_removal = max_prefix_avg(nums[total_removed:])
removed_lengths.append(new_removal)
total_removed += new_removal
return removed_lengths
Edit: The originally published code had a rare error with large inputs from using Python’s math.isclose()
with default parameters for floating point comparison, rather than proper fraction comparison. This has been fixed in the current code. An example of the error can be found at this Try it online link, along with a foreword explaining exactly what causes this bug, if you’re curious.
This problem has a fun O(n) solution.
If you draw a graph of cumulative sum vs index, then:
The average value in the subarray between any two indexes is the slope of the line between those points on the graph.
The first highest-average-prefix will end at the point that makes the highest angle from 0. The next highest-average-prefix must then have a smaller average, and it will end at the point that makes the highest angle from the first ending. Continuing to the end of the array, we find that…
These segments of highest average are exactly the segments in the upper convex hull of the cumulative sum graph.
Find these segments using the monotone chain algorithm. Since the points are already sorted, it takes O(n) time.
# Lengths of the segments in the upper convex hull
# of the cumulative sum graph
def upperSumHullLengths(arr):
if len(arr) < 2:
if len(arr) < 1:
return []
else:
return [1]
hull = [(0, 0),(1, arr[0])]
for x in range(2, len(arr)+1):
# this has x coordinate x-1
prevPoint = hull[len(hull) - 1]
# next point in cumulative sum
point = (x, prevPoint[1] + arr[x-1])
# remove points not on the convex hull
while len(hull) >= 2:
p0 = hull[len(hull)-2]
dx0 = prevPoint[0] - p0[0]
dy0 = prevPoint[1] - p0[1]
dx1 = x - prevPoint[0]
dy1 = point[1] - prevPoint[1]
if dy1*dx0 < dy0*dx1:
break
hull.pop()
prevPoint = p0
hull.append(point)
return [hull[i+1][0] - hull[i][0] for i in range(0, len(hull)-1)]
print(upperSumHullLengths([ 1, 7, 8, 4, 2, 1, 4]))
prints:
[3, 1, 3]
Somewhat simplified versions of Matt’s and kcsquared’s solutions and some benchmarks:
from itertools import accumulate, pairwise
def Matt_Pychoed(arr):
hull = [(0, 0)]
for x, y in enumerate(accumulate(arr), 1):
while len(hull) >= 2:
(x0, y0), (x1, y1) = hull[-2:]
dx0 = x1 - x0
dy0 = y1 - y0
dx1 = x - x1
dy1 = y - y1
if dy1*dx0 < dy0*dx1:
break
hull.pop()
hull.append((x, y))
return [q[0] - p[0] for p, q in pairwise(hull)]
from itertools import accumulate, count
from operator import truediv
def kc_Pychoed_2(nums):
removals = []
while nums:
averages = map(truediv, accumulate(nums), count(1))
remove = max(zip(averages, count(1)))[1]
removals.append(remove)
nums = nums[remove:]
return removals
Benchmark with twenty different arrays of 100,000 random integers from 1 to 1000:
min median mean max
65 ms 164 ms 159 ms 249 ms kc
38 ms 98 ms 92 ms 146 ms kc_Pychoed_1
58 ms 127 ms 120 ms 189 ms kc_Pychoed_2
134 ms 137 ms 138 ms 157 ms Matt
101 ms 102 ms 103 ms 111 ms Matt_Pychoed
Where kc_Pychoed_1
is kcsquared’s but with integer running_sum
and without math.isclose
. And I verify that all solutions compute the same result for every input.
For such random data, kcsquared’s appears to be between O(n) and O(n log n). But it degrades to quadratic if the array is strictly decreasing. For arr = [1000, 999, 998, ..., 2, 1]
I got:
min median mean max
102 ms 106 ms 107 ms 116 ms kc
60 ms 61 ms 61 ms 62 ms kc_Pychoed_1
76 ms 77 ms 77 ms 86 ms kc_Pychoed_2
0 ms 1 ms 1 ms 1 ms Matt
0 ms 0 ms 0 ms 0 ms Matt_Pychoed
Benchmark code (Try it online!):
from timeit import default_timer as timer
from statistics import mean, median
import random
from typing import List, Tuple
import math
from itertools import accumulate, count
from operator import truediv
def kc(nums: List[int]) -> List[int]:
"""Return list of lengths of max average prefix reductions."""
def max_prefix_avg(arr: List[int]) -> Tuple[float, int]:
"""Return value and length of max average prefix in arr"""
if len(arr) == 0:
return (-math.inf, 0)
best_length = 1
best_average = -math.inf
running_sum = 0.0
for i, x in enumerate(arr, 1):
running_sum += x
new_average = running_sum / i
if (new_average >= best_average
or math.isclose(new_average, best_average)):
best_average = new_average
best_length = i
return (best_average, best_length)
removed_lengths = []
total_removed = 0
while total_removed < len(nums):
_, new_removal = max_prefix_avg(nums[total_removed:])
removed_lengths.append(new_removal)
total_removed += new_removal
return removed_lengths
def kc_Pychoed_1(nums: List[int]) -> List[int]:
"""Return list of lengths of max average prefix reductions."""
def max_prefix_avg(arr: List[int]) -> Tuple[float, int]:
"""Return value and length of max average prefix in arr"""
if len(arr) == 0:
return (-math.inf, 0)
best_length = 1
best_average = -math.inf
running_sum = 0
for i, x in enumerate(arr, 1):
running_sum += x
new_average = running_sum / i
if new_average >= best_average:
best_average = new_average
best_length = i
return (best_average, best_length)
removed_lengths = []
total_removed = 0
while total_removed < len(nums):
_, new_removal = max_prefix_avg(nums[total_removed:])
removed_lengths.append(new_removal)
total_removed += new_removal
return removed_lengths
def kc_Pychoed_2(nums):
removals = []
while nums:
averages = map(truediv, accumulate(nums), count(1))
remove = max(zip(averages, count(1)))[1]
removals.append(remove)
nums = nums[remove:]
return removals
# Lengths of the segments in the upper convex hull
# of the cumulative sum graph
def Matt(arr):
if len(arr) < 2:
if len(arr) < 1:
return []
else:
return [1]
hull = [(0, 0),(1, arr[0])]
for x in range(2, len(arr)+1):
# this has x coordinate x-1
prevPoint = hull[len(hull) - 1]
# next point in cumulative sum
point = (x, prevPoint[1] + arr[x-1])
# remove points not on the convex hull
while len(hull) >= 2:
p0 = hull[len(hull)-2]
dx0 = prevPoint[0] - p0[0]
dy0 = prevPoint[1] - p0[1]
dx1 = x - prevPoint[0]
dy1 = point[1] - prevPoint[1]
if dy1*dx0 < dy0*dx1:
break
hull.pop()
prevPoint = p0
hull.append(point)
return [hull[i+1][0] - hull[i][0] for i in range(0, len(hull)-1)]
def pairwise(lst):
return zip(lst, lst[1:])
def Matt_Pychoed(arr):
hull = [(0, 0)]
for x, y in enumerate(accumulate(arr), 1):
while len(hull) >= 2:
(x0, y0), (x1, y1) = hull[-2:]
dx0 = x1 - x0
dy0 = y1 - y0
dx1 = x - x1
dy1 = y - y1
if dy1*dx0 < dy0*dx1:
break
hull.pop()
hull.append((x, y))
return [q[0] - p[0] for p, q in pairwise(hull)]
funcs = kc, kc_Pychoed_1, kc_Pychoed_2, Matt, Matt_Pychoed
stats = min, median, mean, max
tss = [[] for _ in funcs]
for r in range(1, 21):
print(f'After round {r}:')
arr = random.choices(range(1, 1001), k=100_000)
# arr = list(range(1000, 1, -1))
expect = None
print(*(f'{stat.__name__:^7}' for stat in stats))
for func, ts in zip(funcs, tss):
t0 = timer()
result = func(arr)
t1 = timer()
ts.append(t1 - t0)
if expect is None:
expect = result
assert result == expect
print(*('%3d ms ' % (stat(ts) * 1e3) for stat in stats), func.__name__)
print()