Fast way to get items removed from a list and items appended to a list based on an old version and a new version

Question:

I have a class in python that contains a list, that can be updated by calling a function. This list has several associated lists, which are all updated whenever the list is updated. However, one of the lists has roughly 40,000 items, and so when it is updated and all the related lists are updated, and it takes a very long time! Instead of remaking all the lists associated with the new list every time I update the list, I want to get what has been added to the main list and what has been removed, and only update these specific parts of the list.

Below is some code that gets what items have been removed, where self.buildings is the list that contains the old list, and globalvars[self.name] gets the current list.

def updatelist(self):

    globalvars = globals()
    removed = []
    for building in self.buildings:
      if building not in globalvars[self.name]:
        removed.append(building)

However, while this code functions, it takes a very long time for the list with 40,000 items, and I am certain that I could just iterate over the list and find which items have been removed without using in, which has a time complexity that makes this unusable.

Finally, I want some code which can also identify what has been appended to the list. Something will never be inserted into the list at a random place, so there will only by new items appearing at the end which should make it easier to code.

A sample input would be:

self.buildings = [[3, 2, 3, 5], [0, 0, 1, 1], [1, 1, 5, 5], [6, 2, 3, 3]]
globalvars[self.name] = [[0, 0, 1, 1], [1, 1, 5, 5], [8, 2, 6, 3], [6, 2, 7, 0]]

An ideal output would be:

removed = [[3, 2, 3, 5], [6, 2, 3, 3]]
appended = [[8, 2, 6, 3], [6, 2, 7, 0]]

How would I achieve this in an optimised way?

Thanks in advance.

Asked By: Yetiowner

||

Answers:

as i was discussing in the comments section, using sets will speed up your code, but your entries have to be hashable, if the input is in fact list of lists, then converting it to a set of tuples would do the trick, otherwise if it’s a complicated class then you must define a __hash__ function for it.

the below code shows the use of sets for the content in the question.

from random import randint

# construct the large random arrays
large_list_size = 10000
small_list_size = 4
first_list = []
second_list = []
first_list.append([1,1,1,1])
second_list.append([1,1,1,1])
for i in range(large_list_size):
    small_list_instance = [1,1,1,1]
    while small_list_instance in first_list:
        small_list_instance = []
        for j in range(small_list_size):
            small_list_instance.append(randint(0,10))
    first_list.append(small_list_instance)

    small_list_instance = [1, 1, 1, 1]
    while small_list_instance in second_list:
        small_list_instance = []
        for j in range(small_list_size):
            small_list_instance.append(randint(0, 10))
    second_list.append(small_list_instance)


# list implementation in question
def updatelist(first,second):

    removed = []
    for building in first:
      if building not in second:
        removed.append(building)
    return removed

# sets implemenation
def updatelist_sets(first,second):
    first_set = set([(*x,) for x in first])
    second_set = set([(*x,) for x in second])
    removed = first_set.difference(second_set)
    return removed

# measure lists time
import time
retries = 10
t1 = time.time()
for i in range(retries):
    a = updatelist(first_list,second_list)
t2 = time.time()
time_lists = (t2-t1)/retries
print('using lists = ',time_lists)

# measure sets time
t1 = time.time()
for i in range(retries):
    b = updatelist_sets(first_list,second_list)
t2 = time.time()
time_sets = (t2-t1)/retries
print('using sets = ',time_sets)
print('speedup = ',time_lists/time_sets)

# convert b to be like a in structure
b = [list(x) for x in b]

# compare equality of both outputs
if sorted(a) == sorted(b):
    print('both are equal')

you get

using lists =  1.0185745239257813
using sets =  0.004686903953552246
speedup =  217.32353255367963
both are equal

needless to say, most of the time is spent constructing the set of tuples, so if you change your original data structure to a set of hashable objects, then the code is going to be even faster.

Edit: fixed last comparison as @Kelly Bundy suggested.

Answered By: Ahmed AEK
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.