Find duplicate in array – Time complexity < O(n^2) and constant extra space O(1). (Amazon Interview)

Question:

Given below is the problem statement and the solution. I am not able to grasp the logic behind the solution.

Problem Statement:
Given an array nums containing n + 1 integers where each integer is between 1 and n (inclusive), prove that at least one duplicate number must exist. Assume that there is only one duplicate number, find the duplicate one.

Note:
You must not modify the array (assume the array is read only).
You must use only constant, O(1) extra space.
Your runtime complexity should be less than O(n2).
There is only one duplicate number in the array, but it could be repeated more than once.

Sample Input: [3 4 1 4 1]
Output: 1

The Solution for the problem posted on leetcode is:

class Solution(object):
    def findDuplicate(self, nums):
        """
        :type nums: List[int]
        :rtype: int
        """
        low = 1
        high = len(nums)-1

        while low < high:
            mid = low+(high-low)/2
            count = 0
            for i in nums:
                if i <= mid:
                    count+=1
            if count <= mid:
                low = mid+1
            else:
                high = mid
        return low

Explanation for the above code (as per the author):
This solution is based on binary search.

At first the search space is numbers between 1 to n. Each time I select a number mid (which is the one in the middle) and count all the numbers equal to or less than mid. Then if the count is more than mid, the search space will be [1 mid] otherwise [mid+1 n]. I do this until search space is only one number.

Let’s say n=10 and I select mid=5. Then I count all the numbers in the array which are less than equal mid. If the there are more than 5 numbers that are less than 5, then by Pigeonhole Principle (https://en.wikipedia.org/wiki/Pigeonhole_principle) one of them has occurred more than once. So I shrink the search space from [1 10] to [1 5]. Otherwise the duplicate number is in the second half so for the next step the search space would be [6 10].

Doubt: In the above solution, when count <= mid , why are we changing low to low = mid + 1 or otherwise changing high = mid ? What’s the logic behind it?

I am unable to understand the logic behind this algorithm

Related Link:
https://discuss.leetcode.com/topic/25580/two-solutions-with-explanation-o-nlog-n-and-o-n-time-o-1-space-without-changing-the-input-array

Asked By: kshikhar

||

Answers:

lets say you have 10 numbers.

a=[1,2,2,3,4,5,6,7,8,9]

then mid=5
and the number of elements that are less than or equal to 5 are 6 (1,2,2,3,4,5).
now count=6 which is greater than mid. this implies that there is atleast one duplicate in the first half so what the code is doing is making the search space to the first half that is from [1-10] to [1-5] and so on.
Else a duplicate occurs in second half so search space will be [5-10].

Do tell me if you have doubts.

Answered By: Karan Nagpal

The logic behind setting low = mid+1 or high = mid is essentially what makes it a solution based on binary search. The search space is divided in half and the while loop is searching only in the lower half (high = mid) or the higher half (low = mid+1) on its next iteration.

So I shrink the search space from [1 10] to [1 5]. Otherwise the duplicate number is in the second half so for the next step the search space would be [6 10].

This is the part of the explanation regarding your question.

Answered By: Sven

Well it’s a binary search. You are cutting the search space in half and repeating.

Think about it this way: you have a list of 101 items, and you know it contains values 1-100. Take the halfway point, 50. Count how many items are less than or equal to 50. If there are more than 50 items that are less than or equal to 50, then the duplicate is in the range 0-50, otherwise the duplicate is in the range 51-100.

Binary search is simply cutting the range in half. Looking at 0-50, taking midpoint 25 and repeating.


The crucial part of this algorithm which I believe is causing confusion is the for loop. I’ll attempt to explain it. Firstly note that there is no usage of indices anywhere in this algorithm – just inspect the code and you’ll see that index references do not exist. Secondly, note that the algorithm loops through the entire collection for each iteration of the while loop.

Let me make the following change, then consider the value of inspection_count after every while loop.

inspection_count=0
for i in nums:
    inspection_count+=1
    if i <= mid:
        count+=1

Of course inspection_count will be equal to len(nums). The for loop iterates the entire collection, and for every element checks to see whether it is within the candidate range (of values, not indices).

The duplication test itself is simple and elegant – as others pointed out, this is the pigeonhole principle. Given a collection of n values where every value is in the range {p..q}, if q-p < n then there must be duplicates in the range. Think of some easy cases –

p = 0, q = 5, n = 10
"I have ten values, and every value is between zero and five.
At least one of these values must be duplicated."

We can generalize this, but a more valid and relevant example is

p = 50, q = 99, n = 50
"I have a collection of fifty values, and every value is between fifty and ninety-nine.
There are only forty nine *distinct* values in my collection.
Therefore there is a duplicate."
Answered By: Kirk Broadhurst
public static void findDuplicateInArrayTest() {

    int[] arr = {1, 7, 7, 3, 6, 7, 2, 4};

    int dup = findDuplicateInArray(arr, 0, arr.length - 1);

    System.out.println("duplicate: " + dup);
}

public static int findDuplicateInArray(int[] arr, int l, int r) {

    while (l != r) {

        int m = (l + r) / 2;
        int count = 0;

        for (int i = 0; i < arr.length; i++)
            if (arr[i] <= m)
                count++;

        if (count > m)
            r = m;
        else
            l = m + 1;
    }
    return l;
}
Answered By: Vibhor Rastogi
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.