Indexing a list with an unique index

Question:

I have a list say l = [10,10,20,15,10,20]. I want to assign each unique value a certain “index” to get [1,1,2,3,1,2].

This is my code:

a = list(set(l))
res = [a.index(x) for x in l]

Which turns out to be very slow.

l has 1M elements, and 100K unique elements. I have also tried map with lambda and sorting, which did not help. What is the ideal way to do this?

Asked By: Yfiua

||

Answers:

You can do this in O(N) time using a defaultdict and a list comprehension:

>>> from itertools import count
>>> from collections import defaultdict
>>> lst = [10, 10, 20, 15, 10, 20]
>>> d = defaultdict(count(1).next)
>>> [d[k] for k in lst]
[1, 1, 2, 3, 1, 2]

In Python 3 use __next__ instead of next.


If you’re wondering how it works?

The default_factory(i.e count(1).next in this case) passed to defaultdict is called only when Python encounters a missing key, so for 10 the value is going to be 1, then for the next ten it is not a missing key anymore hence the previously calculated 1 is used, now 20 is again a missing key and Python will call the default_factory again to get its value and so on.

d at the end will look like this:

>>> d
defaultdict(<method-wrapper 'next' of itertools.count object at 0x1057c83b0>,
            {10: 1, 20: 2, 15: 3})
Answered By: Ashwini Chaudhary

The slowness of your code arises because a.index(x) performs a linear search and you perform that linear search for each of the elements in l. So for each of the 1M items you perform (up to) 100K comparisons.

The fastest way to transform one value to another is looking it up in a map. You’ll need to create the map and fill in the relationship between the original values and the values you want. Then retrieve the value from the map when you encounter another of the same value in your list.

Here is an example that makes a single pass through l. There may be room for further optimization to eliminate the need to repeatedly reallocate res when appending to it.

res = []
conversion = {}
i = 0
for x in l:
    if x not in conversion:
        value = conversion[x] = i
        i += 1
    else:
        value = conversion[x]
    res.append(value)
Answered By: dsh

Your solution is slow because its complexity is O(nm) with m being the number of unique elements in l: a.index() is O(m) and you call it for every element in l.

To make it O(n), get rid of index() and store indexes in a dictionary:

>>> idx, indexes = 1, {}
>>> for x in l:
...     if x not in indexes:
...         indexes[x] = idx
...         idx += 1
... 
>>> [indexes[x] for x in l]
[1, 1, 2, 3, 1, 2]

If l contains only integers in a known range, you could also store indexes in a list instead of a dictionary for faster lookups.

Answered By: Eugene Yarmash

Well I guess it depends on if you want it to return the indexes in that specific order or not. If you want the example to return:

    [1,1,2,3,1,2]

then you can look at the other answers submitted. However if you only care about getting a unique index for each unique number then I have a fast solution for you

    import numpy as np
    l = [10,10,20,15,10,20]
    a = np.array(l)
    x,y = np.unique(a,return_inverse = True)

and for this example the output of y is:

    y = [0,0,2,1,0,2]

I tested this for 1,000,000 entries and it was done essentially immediately.

Answered By: jfish003

For completness, you can also do it eagerly:

from itertools import count

wordid = dict(zip(set(list_), count(1)))

This uses a set to obtain the unique words in list_, pairs
each of those unique words with the next value from count() (which
counts upwards), and constructs a dictionary from the results.

Original answer, written by nneonneo.

Answered By: GingerPlusPlus

You can use collections.OrderedDict() in order to preserve the unique items in order and, loop over the enumerate of this ordered unique items in order to get a dict of items and those indices (based on their order) then pass this dictionary with the main list to operator.itemgetter() to get the corresponding index for each item:

>>> from collections import OrderedDict
>>> from operator import itemgetter
>>> itemgetter(*lst)({j:i for i,j in enumerate(OrderedDict.fromkeys(lst),1)})
(1, 1, 2, 3, 1, 2)
Answered By: Mazdak
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.