Most efficient way to continuously find the median of a stream of numbers (in Python)?

Question:

I’m trying to solve a problem which reads as follows:

A queue of eager single-digits (your input) are waiting to enter an empty room.

I allow one digit (from the left) to enter the room each minute.

Each time a new digit enters the room I chalk up the median of all the digits currently in the room on the chalkboard. [The median is the middle digit when the digits are arranged in ascending order.]. If there are two median numbers (i.e. two middles) then rather than using the average, I chalk up the lower one of the two.

I chalk new digits to the right of existing digits so my chalkboard number keeps getting longer.

What number ends up on your chalkboard once all the digits are in the room?

Consider the example input: 21423814127333

  • 2 (the leftmost) is allowed into the room where it is the only digit so I write 2 on the chalkboard.
  • 1 is then allowed into the room to join 2. The smaller one of these two is used as the median so I chalk up 1 to the right of 2 on the chalkboard (my number is now 21)
  • 4 now enters the room. The median of 1, 2 and 4 is 2 so I add 2 to my chalkboard (my number is now 212)
  • …this process continues until the final 3 enters the room … all the numbers are in the room now which, when sorted, are 1,1,1,2,2,2,3,3,3,3,4,7,8,8. There are two median digits but they are both 3 so I add 3 to my chalkboard and my final number is 21222222222233

My current solution:

num = input()
new = str(num[0])
whole = [num[0]]

for i in range(1, len(num)):
    whole.append(num[i])
    whole.sort()
    new += whole[i//2]

print(new)

The problem is that it takes too long – so it passes 6/10 (hidden) test cases but exceeds the time limit for the other 4. Any help would be greatly appreciated.

Asked By: Blueberry

||

Answers:

You are repeatedly sorting,
with key comparison,
so total cost is O(N * N log N),
that is, it is at least quadratic.

single-digits (your input) are waiting to enter

The key to this problem is the range limit on input.
We know that each input x is in this range:

0 <= x < 10

Use counters.
We can easily allocate ten of them.

Keep a running count of total number of digits that have been
admitted to the room.
Each time you have to report a median, compute
sum
of ordered counters, stopping when you get
to half the total count.

max_val = 10
counter = {i: 0  for i in range(max_val)}
...
assert 0 <= input_val < max_val

counter[input_val] += 1

cum_sum = 0
for i in range(max_val):
    cum_sum += counter[i]
    ...

Since median is a robust statistic,
typically there will be some stability
in the median you report, e.g. "2, 1, 2, 2, 2, 2".
You can use that to speed the computation
still further, by incrementally computing the
cumulative sum.
It won’t change the big-Oh complexity, though,
as there’s a constant number of counters.
We’re still looking at O(N), since we have to
examine each of the N digits admitted to the room and then report the current median.
This does beat the O(N log N) cost of an approach that
relies on bisecting an ordered vector.

Answered By: J_H

Since whole is already sorted you can use bisect.insort to insert new items and keep it sorted:

from bisect import insort
num = input()
new = str(num[0])
whole = [num[0]]

for i in range(1, len(num)):
    insort(whole, num[i])
    new += whole[i//2]

print(new)
Answered By: Chris Wesseling

Maintain a list of cumulative counts of each digit and compute the median by finding the first position that corresponds to at least half the number of digits added so far:

def runMed(S):
    cum = [0]*10
    for i,digit in enumerate(map(int,S),1):
        cum[digit:] = (c+1 for c in cum[digit:])
        yield next(m for m,c in enumerate(cum) if c*2>=i)

output:

S = "21423814127333"
print(*runMed(S))
# 2 1 2 2 2 2 2 2 2 2 2 2 3 3

Each number will take at most 20 iterations to produce the median, resulting in an O(n) solution.

Answered By: Alain T.

If you keep a counter per digit (in a list) you actually have the sorted sequence implicitly represented. Then have a "pointer" to the current median: this pointer consists of two components: the median value itself (which is an index in the counter list), and which occurrence of that value really represents the median (an occurrence number).

When a new input digit is processed you can then decide whether this pointer should be updated. It either doesn’t, or it moves 1 unit forward or backward in the sorted (implicit) list.

Code:

def generatemedians(iterable):
    counter = [0] * 10  # a counter for each digit

    it = map(int, iterable)
    # Process the first entry
    median = next(it, None)
    if median is None:
        return  # No values
    medianidx = 0
    counter[median] = 1
    yield median

    # Process the other entries
    for i, digit in enumerate(it):
        counter[digit] += 1
        if i % 2 == 0:  # the total number of digits becomes even
            if digit < median:  # The median only changes if the digit is inferior
                if medianidx:
                    medianidx -= 1
                else:
                    median -= 1
                    while not counter[median]:
                        median -= 1
                    medianidx = counter[median] - 1
        else:  # the number of digits becomes odd
            if digit >= median:  # The median doesn't change if the digit is inferior
                if medianidx < counter[median] - 1:
                    medianidx += 1
                else:
                    median += 1
                    while not counter[median]:
                        median += 1
                    medianidx = 0
        yield median

# main 
num = input()
print(*generatemedians(num))

One iteration of the outer loop takes constant time, even when median += 1 or median -= 1 has to execute multiple times, as the range of median is 0..9.

Answered By: trincot
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.