Getting first n unique elements from Python list

Question:

I have a python list where elements can repeat.

>>> a = [1,2,2,3,3,4,5,6]

I want to get the first n unique elements from the list.
So, in this case, if i want the first 5 unique elements, they would be:

[1,2,3,4,5]

I have come up with a solution using generators:

def iterate(itr, upper=5):

    count = 0
    for index, element in enumerate(itr):
        if index==0:
            count += 1
            yield element

        elif element not in itr[:index] and count<upper:
            count += 1
            yield element

In use:

>>> i = iterate(a, 5)
>>> [e for e in i]
[1,2,3,4,5]

I have doubts on this being the most optimal solution. Is there an alternative strategy that i can implement to write it in a more pythonic and efficient
way?

Asked By: xssChauhan

||

Answers:

I would use a set to remember what was seen and return from the generator when you have seen enough:

a = [1, 2, 2, 3, 3, 4, 5, 6]
    
def get_unique_N(iterable, N):
    """Yields (in order) the first N unique elements of iterable. 
    Might yield less if data too short."""
    seen = set()
    for e in iterable:
        if e in seen:
            continue
        seen.add(e)
        yield e
        if len(seen) == N:
            return
            
k = get_unique_N([1, 2, 2, 3, 3, 4, 5, 6], 4)
print(list(k))
    

Output:

[1, 2, 3, 4]

According to PEP-479 you should return from generators, not raise StopIteration – thanks to @khelwood & @iBug for that piece of comment – one never learns out.

With 3.6 you get a deprecated warning, with 3.7 it gives RuntimeErrors: Transition Plan if still using raise StopIteration


Your solution using elif element not in itr[:index] and count<upper: uses O(k) lookups – with k being the length of the slice – using a set reduces this to O(1) lookups but uses more memory because the set has to be kept as well. It is a speed vs. memory tradeoff – what is better is application/data dependend.

Consider [1, 2, 3, 4, 4, 4, 4, 5] vs [1] * 1000 + [2] * 1000 + [3] * 1000 + [4] * 1000 + [5] * 1000 + [6]:

For 6 uniques (in longer list):

  • you would have lookups of O(1)+O(2)+...+O(5001)
  • mine would have 5001*O(1) lookup + memory for set( {1, 2, 3, 4, 5, 6})
Answered By: Patrick Artner

You can adapt the popular itertools unique_everseen recipe:

def unique_everseen_limit(iterable, limit=5):
    seen = set()
    seen_add = seen.add
    for element in iterable:
        if element not in seen:
            seen_add(element)
            yield element
        if len(seen) == limit:
            break

a = [1,2,2,3,3,4,5,6]

res = list(unique_everseen_limit(a))  # [1, 2, 3, 4, 5]

Alternatively, as suggested by @Chris_Rands, you can use itertools.islice to extract a fixed number of values from a non-limited generator:

from itertools import islice

def unique_everseen(iterable):
    seen = set()
    seen_add = seen.add
    for element in iterable:
        if element not in seen:
            seen_add(element)
            yield element

res = list(islice(unique_everseen(a), 5))  # [1, 2, 3, 4, 5]

Note the unique_everseen recipe is available in 3rd party libraries via more_itertools.unique_everseen or toolz.unique, so you could use:

from itertools import islice
from more_itertools import unique_everseen
from toolz import unique

res = list(islice(unique_everseen(a), 5))  # [1, 2, 3, 4, 5]
res = list(islice(unique(a), 5))           # [1, 2, 3, 4, 5]
Answered By: jpp

Using set with sorted+ key

sorted(set(a), key=list(a).index)[:5]
Out[136]: [1, 2, 3, 4, 5]
Answered By: BENY

There are really amazing answers for this question, which are fast, compact and brilliant! The reason I am putting here this code is that I believe there are plenty of cases when you don’t care about 1 microsecond time loose nor you want additional libraries in your code for one-time solving a simple task.

a = [1,2,2,3,3,4,5,6]
res = []
for x in a:
    if x not in res:  # yes, not optimal, but doesnt need additional dict
        res.append(x)
        if len(res) == 5:
            break
print(res)
Answered By: grapes

You can use OrderedDict or, since Python 3.7, an ordinary dict, since they are implemented to preserve the insertion order. Note that this won’t work with sets.

N = 3
a = [1, 2, 2, 3, 3, 3, 4]
d = {x: True for x in a}
list(d.keys())[:N]
Answered By: Jindra Helcl

If your objects are hashable (ints are hashable) you can write utility function using fromkeys method of collections.OrderedDict class (or starting from Python3.7 a plain dict, since they became officially ordered) like

from collections import OrderedDict


def nub(iterable):
    """Returns unique elements preserving order."""
    return OrderedDict.fromkeys(iterable).keys()

and then implementation of iterate can be simplified to

from itertools import islice


def iterate(itr, upper=5):
    return islice(nub(itr), upper)

or if you want always a list as an output

def iterate(itr, upper=5):
    return list(nub(itr))[:upper]

Improvements

As @Chris_Rands mentioned this solution walks through entire collection and we can improve this by writing nub utility in a form of generator like others already did:

def nub(iterable):
    seen = set()
    add_seen = seen.add
    for element in iterable:
        if element in seen:
            continue
        yield element
        add_seen(element)
Answered By: Azat Ibrakov

Here is a Pythonic approach using itertools.takewhile():

In [95]: from itertools import takewhile

In [96]: seen = set()

In [97]: set(takewhile(lambda x: seen.add(x) or len(seen) <= 4, a))
Out[97]: {1, 2, 3, 4}
Answered By: Mazdak

Assuming the elements are ordered as shown, this is an opportunity to have fun with the groupby function in itertools:

from itertools import groupby, islice

def first_unique(data, upper):
    return islice((key for (key, _) in groupby(data)), 0, upper)

a = [1, 2, 2, 3, 3, 4, 5, 6]

print(list(first_unique(a, 5)))

Updated to use islice instead of enumerate per @juanpa.arrivillaga. You don’t even need a set to keep track of duplicates.

Answered By: cdlane

Given

import itertools as it


a = [1, 2, 2, 3, 3, 4, 5, 6]

Code

A simple list comprehension (similar to @cdlane’s answer).

[k for k, _ in it.groupby(a)][:5]
# [1, 2, 3, 4, 5]

Alternatively, in Python 3.6+:

list(dict.fromkeys(a))[:5]
# [1, 2, 3, 4, 5]
Answered By: pylang

Why not use something like this?

>>> a = [1, 2, 2, 3, 3, 4, 5, 6]
>>> list(set(a))[:5]
[1, 2, 3, 4, 5]
Answered By: Aleksandr

Example list:

a = [1, 2, 2, 3, 3, 4, 5, 6]

Function returns all or count of unique items needed from list

1st argument – list to work with, 2nd argument (optional) – count of unique items (by default – None – it means that all unique elements will be returned)

def unique_elements(lst, number_of_elements=None):
    return list(dict.fromkeys(lst))[:number_of_elements]

Here is example how it works. List name is “a”, and we need to get 2 unique elements:

print(unique_elements(a, 2))

Output:

output

Answered By: Quanti Monati
a = [1,2,2,3,3,4,5,6]

from collections import defaultdict
def function(lis,n):
    dic = defaultdict(int)

    sol=set()

    for i in lis:
            try:
                if dic[i]:
                    pass
                else:
                    sol.add(i)
                    dic[i]=1
                    if len(sol)>=n:
                        break
            except KeyError:
                pass

    return list(sol)

print(function(a,3))

output

[1, 2, 3]
Answered By: sahasrara62

Profiling Analysis

Solutions

Which solution is the fastest? There are two clear favorite answers (and 3 solutions) that captured most of the votes.

  1. The solution by Patrick Artner – denoted as PA.
  2. The first solution by jpp – denoted as jpp1
  3. The second solution by jpp – denoted as jpp2

This is because these claim to run in O(N) while others here run in O(N^2), or do not guarantee the order of the returned list.

Experiment setup

For this experiment 3 variables were considered.

  1. N elements. The number of first N elements the function is searching for.
  2. List length. The longer the list the further the algorithm has to look to find the last element.
  3. Repeat limit. How many times an element can repeat before the next element occurs in the list. This is uniformly distributed between 1 and the repeat limit.

The assumptions for data generation were as follows. How strict these are depend on the algorithm used, but is more a note on how the data was generated than a limitation on the algorithms themselves.

  1. The elements never occur again after its repeated sequence first appears in the list.
  2. The elements are numeric and increasing.
  3. The elements are of type int.

So in a list of [1,1,1,2,2,3,4 ….] 1,2,3 would never appear again. The next element after 4 would be 5, but there could be a random number of 4s up to the repeat limit before we see 5.

A new dataset was created for each combination of variables and and re-generated 20 times. The python timeit function was used to profile the algorithms 50 times on each dataset. The mean time of the 20×50=1000 runs (for each combination) were reported here. Since the algorithms are generators, their outputs were converted to a list to get the execution time.

Results

As is expected the more elements searched for, the longer it takes. This graph shows that the execution time is indeed O(N) as claimed by the authors (the straight line proves this).

Fig 1. Varying the first N elements searched for.

Fig 1. Varying the first N elements searched for.

All three solutions do not consume additional computation time beyond that which is required. The below image shows what happens when the list is limited in size, and not N elements. Lists of length 10k, with elements repeating a maximum of 100 times (and thus on average repeating 50 times) would on average run out of unique elements by 200 (10000/50). If any of these graphs showed an increase in computation time beyond 200 this would be a cause for concern.

Fig 2. The effect of first N elements chosen > number of unique elements.

Fig 2. The effect of first N elements chosen > number of unique elements.

The figure below again shows that processing time increases (at a rate of O(N)) the more data the algorithm has to sift through. The rate of increase is the same as when first N elements were varied. This is because stepping through the list is the common execution block in both, and the execution block that ultimately decides how fast the algorithm is.

Fig 3. Varying the repeat limit.

Fig 3. Varying the repeat limit.

Conclusion

The 2nd solution posted by jpp is the fastest solution of the 3 in all cases. The solution is only slightly faster than the solution posted by Patrick Artner, and is almost twice as fast as his first solution.

Answered By: Jurgen Strydom