How quickly identify duplicate numbers in the list with dictionaries

Question

I have a little question about searching same numbers.

What I have:

I have a list with dictionaries. In this dictionaries there are 2 pairs key-val, one is unique id and other is list with numbers.

What I need to do:

I need to define if there is any same number in other lists in other dicts and write parent id and this same number in the processed_data dictionary. But I have very large all list and this processing will take around 3.6 days.

So here is a question

I want to find any other methods to process it faster, for about 12 hours if possible and would be thankful for any help.

PS (Maybe some multiprocessing or threading will help somehow, but I got big doubts about if it is possible)

(Let’s omit here situations when it iter the same dict & its list and find same number)

# List with dictionaries:
all = [
       {'id' : 0, 'numbers' : [1, 2, 3, 4, 5]},
       {'id' : 1, 'numbers' : [5, 7, 9]},
       {'id' : 2, 'numbers' : [10, 12, 14]},
       {'id' : 3, 'numbers' : [3, 12, 5]}
]

# Here I will store the results
processed_data = {}

# For every row
for every in all:

    id = every['id']
    numbers_list = every['numbers']

    for every_2 in all:
        numbers_list_2 = every_2['numbers']

        # If number exists in the other row
        # I remember id of the list which contains this number
        # And number
        for number in numbers_list_2:
            if number in numbers_list:
                processed_data[id] = number

print(processed_data)

Expected output:

{0: 5, 1: 5, 2: 12, 3: 5}

Asked By: user8531240

||

Source

Answer 1

You are indeed processing your lists very inefficiently, because you are combining every dictionary with every other dictionary (an O(N^2) order process), then compound this by also combining every value in a list of length K with another list of numbers of similar length with a sequential scan, so you end up with O(N^2 * K^2). That’s going to take a long, long time. It’s a sequential scan because the containment test you use against your numbers (number in numbers_list) has to test every value in the numbers_list list one by one until a match is found or the list is exhausted.

You want to learn about sets. Sets let you test if a number is present in constant time, because each value itself is the key to its location in the data structure, via its hash (so something in setvalue just has to see if hash(something) exists in the set table). You can also get their intersection very efficiently, so the set of numbers that are present in both sets. For two sets of average size K that takes O(K) linear time. The intersection is simply setA & setB.

I’d convert your data structure to contain sets to start with, then for the intersections between two lists record both ids per number in the set, because both lists have numbers in common. That then lets you reduce the O(N^2) pairing, because when you have tested id A with id B, you don’t have to do the same thing for id B against id A. That’s technically still O(N^2), but it’s really a triangle number or arithmetic series; N + N – 1 + N – 2 …. which is (N * (N + 1)) // 2, a decidedly lower number of iterations that for real-world problems is much preferable over full-on N^2; halving the time taken still matters when comparing approaches with real-life situations and finite input sizes.

That looks like this:

all = [
    {'id' : 0, 'numbers' : [1, 2, 3, 4, 5]},
    {'id' : 1, 'numbers' : [5, 7, 9]},
    {'id' : 2, 'numbers' : [10, 12, 14]},
    {'id' : 3, 'numbers' : [3, 12, 5]},
]
# conversion to sets
for d in all:
    d['numbers'] = set(d['numbers'])

processed_data = {}

for i, entry in enumerate(all):
    id1 = entry['id']
    for j in range(i + 1, len(all)):
        id2 = all[j]['id']
        for num in entry['numbers'] & all[j]['numbers']:
            processed_data.update({id1: num, id2: num})

The nested loop can be further simplified by using itertools.combinations() to do the pairing for us:

from itertools import combinations

# everything before the loop the same up to setting `processed_data`

for entry1, entry2 in combinations(all, r=2):
    id1, id2 = entry1['id'], entry2['id']
    for num in entry1['numbers'] & entry2['numbers']:
        processed_data.update({id1: num, id2: num})

So this halved the number of combinations between the dictionaries you have, and per combination, the work is done in linear time instead of quadratic time. That already could easily drop the processing time to 6 hours, and quite possibly less than that. That’s because replacing a O(K^2) algorithm with one that takes O(K) time (with similar fixed costs per step) lets you divide the time taken by K. If your lists have a length of 1000 elements, we can expect a 1000-fold decrease in time taken; 3.6 days is about 311000 seconds, so expecting a 3100 second runtime (a little more than 5 hours, and only very conservative 100-fold speed-up, not 1000) is not out of the realms of possibilities here. It sounds as if your lists are longer than that, imagine how much faster avoiding the O(K^2) comparisons loops can prove to be!

Next, because your output uses id as a unique value, but your input in all may have multiple entries for a given id, you could further reduce N and K here by first combining the numbers for each id value into a single set. You can do so by mapping all to a dictionary with id as the key, and sets as the values. Because we are still bound by a O(N^2) algorithm to combine the entries in all, reducing N can have a big impact; if N = 1000, going to N = 999 removes 1000 set intersections from consideration (replaced by one set update to merge the duplicate id first).

Combining the inputs first when converting to sets then becomes:

from itertools import combinations

all = [
    {'id' : 0, 'numbers' : [1, 2, 3, 4, 5]},
    {'id' : 1, 'numbers' : [5, 7, 9]},
    {'id' : 2, 'numbers' : [10, 12, 14]},
    {'id' : 3, 'numbers' : [3, 12, 5]},
    # I've added another entry with id 1, but different numbers
    {'id' : 1, 'numbers' : [17, 42, 11]},
]
# conversion to sets
all_sets = {}
for d in all:
    all_sets.setdefault(d['id'], set()).update(d['numbers'])

# if memory is an issue at this point, consider adding 'del all'.

processed_data = {}

for (id1, numbers1), (ids2, numbers2) in combinations(all_sets.items(), r=2):
    for num in numbers1 & numbers2:
        processed_data.update({id1: num, id2: num})

That double loop is actually simple enough to turn into a dict comprehension; no further advantage other than lowering the fixed cost of executing Python bytecode each step:

processed_data = {
    id: num
    for (id1, num1), (ids2, num2) in combinations(all_sets.items(), r=2)
    for num in num1 & num2
    for id in (id1, id2)
}

Note that we are only recording the last number that is shared with another list for a given id. If you wanted to record all such numbers, you need to make processed_data values lists or sets, so containers that can hold multiple numbers.

You can’t then use a dict comprehension. Instead use dict.setdefault() to ensure there is an empty container, and add to the container that’s returned:

processed_data = {}

for (id1, numbers1), (ids2, numbers2) in combinations(all_sets.items(), r=2):
    for num in numbers1 & numbers2:
        processed_data.setdefault(id1, set()).add(num)
        processed_data.setdefault(id2, set()).add(num)

This creates sets, but you can also use lists, so processed_data.setdefault(id1, []).append(num), etc. For your sample all data, this produces:

>>> processed_data
{0: {3, 5}, 3: {3, 5, 12}, 1: {5}, 2: {12}}

The above approach requires numbers to contain hashable values, which integers are. If your actual setup doesn’t have hashable values, make them hashable, by converting them for this job. The time savings are worth the effort.

E.g. if you have lists of integers that you must find matches for, convert them to tuples first. That is easily done:

all_sets = {}
for d in all:
    all_sets.setdefault(d['id'], set()).update(map(tuple, d['numbers']))

where map(tuple, d['numbers']) applies tuple() to each element before adding those tuples to the sets.

Answered By: Martijn Pieters

Answer 2

My first suggestion would be to pre-process the lists to convert them to sets, that way you can check for set intersections which would likely be much simpler than iterating all the lists a ton. If your numbers are small you could even use bitfields for even faster intersection checking (I guess there’s a balance to be had between the size of the lists and the number sets / amount of collisions)

The second is that you don’t need to keep checking all the lists: for the first list you can check every other list, but for the second list you don’t need to check the first one again, 1&2 and 2&1 are the same thing so you already checked the intersection of 2 and 1 when you checked the intersection of 1 and 2. This means the kth list (out of n) only needs to be checked against n-k other lists

Answered By: Masklinn

Answer 3

Another solution could be to use Pandas instead of lists and sets.

import pandas as pd

df = pd.DataFrame(all)

dfe = df.explode('numbers')

dfe
   id numbers
0   0       1
0   0       2
0   0       3
0   0       4
0   0       5
1   1       5
1   1       7
1   1       9
2   2      10
2   2      12
2   2      14
3   3       3
3   3      12
3   3       5

from there its easy to manipulate duplicates (e.g. see drop_duplicates method) based on the id and the numbers col. You can also try to combine .groupby() and .duplicated() methods.

Answered By: tmo

Answer 4

I would suggest you try to first exact the data from Python and move it into a more durable storage solution (ie. a database). Since you are talking about a very large volume of data and considering the possibility of spending days churning on it, you might set yourself up for success if you spend some time up front pre-processing this data set.

Doing this will also make it easier to re-distribute the data to more than one compute node if you later decide that you need such a solution. On top of that, if you are able to restructure this data to efficiently fit into database tables, you might find that you can re-cast the question as something that is easier to solve using SQL. That has the benefit of essentially letting someone else (again, the database) worry about the details of the implementation, and all you need to do is describe the desired output.

Answered By: Z4-tier

Answer 5

here is an approach where first I count the frequency of all the pair numbers in the input list then I iterate over same input list and I check if the current item has pair numbers with frequency > 1,
if the pair number frequency is > 1 I will append to the processed_data.

from collections import Counter
from operator import itemgetter
from itertools import chain

all = [
   {'id' : 0, 'numbers' : [[1,2], [2,9], [4,7]]},
   {'id' : 1, 'numbers' : [[8,2],  [2,9], [3,3]]},
   {'id' : 2, 'numbers' : [[6,6], [1,2], [8,8]]},
   {'id' : 3, 'numbers' : [[7,2], [8,1], [1,9]]}
]

# use tuple for pair numbers, needed for Counter, lists are not hashable
all = [{'id': e['id'], 'numbers': [tuple(l) for l in e['numbers']]} for e in all]

full_counter = Counter(chain(*map(itemgetter('numbers'), all)))

def get_nums_in_other(l):
    counter = Counter(l)
    return [e for e in counter if counter[e] < full_counter[e]]

processed_data = {i['id']: get_nums_in_other(i['numbers'])  for i in all}

# transform tuples in lists
processed_data = {k: [list(e) for e in v] for k, v in processed_data.items() if v}
processed_data

output:

{0: [[1, 2], [2, 9]], 1: [[2, 9]], 2: [[1, 2]]}

to avoid iterating 1 time over all and one time over processed_data(when tuples are transforming to lists) you can use a for loop:

def get_nums_in_other(l):
    counter = Counter(l)
    return [list(e) for e in counter if counter[e] < full_counter[e]]

processed_data = {}

for item in all:
    nums_in_other = get_nums_in_other(item['numbers'])
    if nums_in_other:
        processed_data[item['id']] = nums_in_other
processed_data

Answered By: kederrac

How quickly identify duplicate numbers in the list with dictionaries

Question:

Answers: