What is the fastest way to add data to a list without duplication in python (2.5)

Question:

I have about half a million items that need to be placed in a list, I can’t have duplications, and if an item is already there I need to get it’s index. So far I have

if Item in List:
    ItemNumber=List.index(Item)
else:
    List.append(Item)
    ItemNumber=List.index(Item)

The problem is that as the list grows it gets progressively slower until at some point it just isn’t worth doing. I am limited to python 2.5 because it is an embedded system.

Asked By: Dennis

||

Answers:

You can use a set (in CPython since version 2.4) to efficiently look up duplicate values. If you really need an indexed system as well, you can use both a set and list.

Doing your lookups using a set will remove the overhead of if Item in List, but not that of List.index(Item)

Please note ItemNumber=List.index(Item) will be very inefficient to do after List.append(Item). You know the length of the list, so your index can be retrieved with ItemNumber = len(List)-1.

To completely remove the overhead of List.index (because that method will search through the list – very inefficient on larger sets), you can use a dict mapping Items back to their index.

I might rewrite it as follows:

# earlier in the program, NOT inside the loop
Dup = {}

# inside your loop to add items:
if Item in Dup:
    ItemNumber = Dup[Item]
else:
    List.append(Item)
    Dup[Item] = ItemNumber = len(List)-1
Answered By: lunixbochs

If you really need to keep the data in an array, I’d use a separate dictionary to keep track of duplicates. This requires twice as much memory, but won’t slow down significantly.

existing = dict()
if Item in existing:
    ItemNumber = existing[Item]
else:
    ItemNumber = existing[Item] = len(List)
    List.append(Item)

However, if you don’t need to save the order of items you should just use a set instead. This will take almost as little space as a list, yet will be as fast as a dictionary.

Items = set()
# ...
Items.add(Item) # will do nothing if Item is already added

Both of these require that your object is hashable. In Python, most types are hashable unless they are a container whose contents can be modified. For example: lists are not hashable because you can modify their contents, but tuples are hashable because you cannot.

If you were trying to store values that aren’t hashable, there isn’t a fast general solution.

Answered By: Jeremy

You can improve the check a lot:

check = set(List)

for Item in NewList:
    if Item in check: ItemNumber = List.index(Item)
    else:
        ItemNumber = len(List)
        List.append(Item)

Or, even better, if order is not important you can do this:

oldlist = set(List)
addlist = set(AddList)
newlist = list(oldlist | addlist)

And if you need to loop over the items that were duplicated:

for item in (oldlist & addlist):
    pass # do stuff
Answered By: orlp
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.