Fastest union for multiple sorted lists removing duplicates and get an ordered result
Question:
When having a list of lists in which all the sublists are ordered e.g.:
[[1,3,7,20,31], [1,2,5,6,7], [2,4,25,26]]
what is the fastest way to get the union of these lists without having duplicates in it and also get an ordered result?
So the resulting list should be:
[1,2,3,4,5,6,7,20,25,26,31]
.
I know I could just union them all without duplicates and then sort them but are there faster ways (like: do the sorting while doing the union) built into python?
EDIT:
Is the proposed answer faster than executing the following algorithm pairwise with all the sublists?
SOLUTION. UNION(x, y)
1 answer<-()
2 while x != NIL and y != NIL
3 do if docID(x) = docID(y)
4 then ADD(answer, docID(x))
5 x<-next(x)
6 y<-next(y)
7 else if docID(x)<docID(y)
8 then ADD(answer, docID(x))
9 x<-next(x)
10 else ADD(answer, docID(y))
11 y<-next(y)
12 return(answer)
Answers:
You can use heapq.merge
for this:
def mymerge(v):
from heapq import merge
last = None
for a in merge(*v):
if a != last: # remove duplicates
last = a
yield a
print(list(mymerge([[1,3,7,20,31], [1,2,5,6,7], [2,4,25,26]])))
# [1, 2, 3, 4, 5, 6, 7, 20, 25, 26, 31]
You can make use of sets in Python-3.
mylist = [[1,3,7,20,31], [1,2,5,6,7], [2,4,25,26]]
mynewlist = mylist[0] + mylist[1] + mylist[2]
print(list(set(mynewlist)))
Output:
[1, 2, 3, 4, 5, 6, 7, 20, 25, 26, 31]
First merge all the sub-lists using list addition.
Then convert it into a set object where it will delete all the duplicates which will also be sorted in ascending order.
Convert it back to list.It gives your desired output.
Hope it answers your question.
(EDITED)
The asymptotic theoretical best approach to the problem is to use the priority queue, like, for example, the one implemented in heapq.merge()
(thanks to @kaya3 for pointing this out).
However, in practice, a number of things can go wrong. For example the constant factors in the complexity analysis are large enough that a theoretically-optimal approach is, in real-life scenarios, slower.
This is fundamentally dependent on the implementation.
For example, Python suffer some speed penalty for explicit looping.
So, let’s consider a couple of approaches and how the do perform for some concrete inputs.
Approaches
Just to give you some idea of the numbers we are discussing, here are a few approaches:
merge_sorted()
which uses the most naive approach of flatten the sequence, reduce it to a set()
(removing duplicates) and sort it as required
import itertools
def merge_sorted(seqs):
return sorted(set(itertools.chain.from_iterable(seqs)))
merge_heapq()
which essentially @arshajii’s answer. Note that the itertools.groupby()
variation is slightly (less than ~1%) faster.
import heapq
def i_merge_heapq(seqs):
last_item = None
for item in heapq.merge(*seqs):
if item != last_item:
yield item
last_item = item
def merge_heapq(seqs):
return list(i_merge_heapq(seqs))
merge_bisect_set()
is substantially the same algorithm as merge_sorted()
except that the result is now constructed explicitly using the efficient bisect
module for sorted insertions. Since sorted()
is doing fundamentally the same thing but looping in Python, this is not going to be faster.
import itertools
import bisect
def merge_bisect_set(seqs):
result = []
for item in set(itertools.chain.from_iterable(seqs)):
bisect.insort(result, item)
return result
merge_bisect_cond()
is similar to merge_bisect_set()
but now the non-repeating constraint is explicitly done using the final list
. However, this is much more expensive than just using set()
(in fact it is so slow that it was excluded from the plots).
def merge_bisect_cond(seqs):
result = []
for item in itertools.chain.from_iterable(seqs):
if item not in result:
bisect.insort(result, item)
return result
merge_pairwise()
explicitly implements the the theoretically efficient algorithm, similar to what you outlined in your question.
def join_sorted(seq1, seq2):
result = []
i = j = 0
len1, len2 = len(seq1), len(seq2)
while i < len1 and j < len2:
if seq1[i] < seq2[j]:
result.append(seq1[i])
i += 1
elif seq1[i] > seq2[j]:
result.append(seq2[j])
j += 1
else: # seq1[i] == seq2[j]
result.append(seq1[i])
i += 1
j += 1
if i < len1:
result.extend(seq1[i:])
elif j < len2:
result.extend(seq2[j:])
return result
def merge_pairwise(seqs):
result = []
for seq in seqs:
result = join_sorted(result, seq)
return result
merge_loop()
implements a generalization of the above, where now pass is done only once for all sequences, instead of doing this pairwise.
def merge_loop(seqs):
result = []
lengths = list(map(len, seqs))
idxs = [0] * len(seqs)
while any(idx < length for idx, length in zip(idxs, lengths)):
item = min(
seq[idx]
for idx, seq, length in zip(idxs, seqs, lengths) if idx < length)
result.append(item)
for i, (idx, seq, length) in enumerate(zip(idxs, seqs, lengths)):
if idx < length and seq[idx] == item:
idxs[i] += 1
return result
Benchmarks
By generating the input using:
def gen_input(n, m=100, a=None, b=None):
if a is None and b is None:
b = 2 * n * m
a = -b
return tuple(tuple(sorted(set(random.randint(int(a), int(b)) for _ in range(n)))) for __ in range(m))
one can plot the timings for varying n
:
Note that, in general, the performances will vary for different values of n
(the size of each sequence) and m
(the number of sequences), but also of a
and b
(the minimum and the maximum number generated).
For brevity, this was not explored in this answer, but feel free to play around with it here, which also includes some other implementations, notably some tentative speed-ups with Cython that were only partially successful.
When having a list of lists in which all the sublists are ordered e.g.:
[[1,3,7,20,31], [1,2,5,6,7], [2,4,25,26]]
what is the fastest way to get the union of these lists without having duplicates in it and also get an ordered result?
So the resulting list should be:
[1,2,3,4,5,6,7,20,25,26,31]
.
I know I could just union them all without duplicates and then sort them but are there faster ways (like: do the sorting while doing the union) built into python?
EDIT:
Is the proposed answer faster than executing the following algorithm pairwise with all the sublists?
SOLUTION. UNION(x, y)
1 answer<-()
2 while x != NIL and y != NIL
3 do if docID(x) = docID(y)
4 then ADD(answer, docID(x))
5 x<-next(x)
6 y<-next(y)
7 else if docID(x)<docID(y)
8 then ADD(answer, docID(x))
9 x<-next(x)
10 else ADD(answer, docID(y))
11 y<-next(y)
12 return(answer)
You can use heapq.merge
for this:
def mymerge(v):
from heapq import merge
last = None
for a in merge(*v):
if a != last: # remove duplicates
last = a
yield a
print(list(mymerge([[1,3,7,20,31], [1,2,5,6,7], [2,4,25,26]])))
# [1, 2, 3, 4, 5, 6, 7, 20, 25, 26, 31]
You can make use of sets in Python-3.
mylist = [[1,3,7,20,31], [1,2,5,6,7], [2,4,25,26]]
mynewlist = mylist[0] + mylist[1] + mylist[2]
print(list(set(mynewlist)))
Output:
[1, 2, 3, 4, 5, 6, 7, 20, 25, 26, 31]
First merge all the sub-lists using list addition.
Then convert it into a set object where it will delete all the duplicates which will also be sorted in ascending order.
Convert it back to list.It gives your desired output.
Hope it answers your question.
(EDITED)
The asymptotic theoretical best approach to the problem is to use the priority queue, like, for example, the one implemented in heapq.merge()
(thanks to @kaya3 for pointing this out).
However, in practice, a number of things can go wrong. For example the constant factors in the complexity analysis are large enough that a theoretically-optimal approach is, in real-life scenarios, slower.
This is fundamentally dependent on the implementation.
For example, Python suffer some speed penalty for explicit looping.
So, let’s consider a couple of approaches and how the do perform for some concrete inputs.
Approaches
Just to give you some idea of the numbers we are discussing, here are a few approaches:
merge_sorted()
which uses the most naive approach of flatten the sequence, reduce it to aset()
(removing duplicates) and sort it as required
import itertools
def merge_sorted(seqs):
return sorted(set(itertools.chain.from_iterable(seqs)))
merge_heapq()
which essentially @arshajii’s answer. Note that theitertools.groupby()
variation is slightly (less than ~1%) faster.
import heapq
def i_merge_heapq(seqs):
last_item = None
for item in heapq.merge(*seqs):
if item != last_item:
yield item
last_item = item
def merge_heapq(seqs):
return list(i_merge_heapq(seqs))
merge_bisect_set()
is substantially the same algorithm asmerge_sorted()
except that the result is now constructed explicitly using the efficientbisect
module for sorted insertions. Sincesorted()
is doing fundamentally the same thing but looping in Python, this is not going to be faster.
import itertools
import bisect
def merge_bisect_set(seqs):
result = []
for item in set(itertools.chain.from_iterable(seqs)):
bisect.insort(result, item)
return result
merge_bisect_cond()
is similar tomerge_bisect_set()
but now the non-repeating constraint is explicitly done using the finallist
. However, this is much more expensive than just usingset()
(in fact it is so slow that it was excluded from the plots).
def merge_bisect_cond(seqs):
result = []
for item in itertools.chain.from_iterable(seqs):
if item not in result:
bisect.insort(result, item)
return result
merge_pairwise()
explicitly implements the the theoretically efficient algorithm, similar to what you outlined in your question.
def join_sorted(seq1, seq2):
result = []
i = j = 0
len1, len2 = len(seq1), len(seq2)
while i < len1 and j < len2:
if seq1[i] < seq2[j]:
result.append(seq1[i])
i += 1
elif seq1[i] > seq2[j]:
result.append(seq2[j])
j += 1
else: # seq1[i] == seq2[j]
result.append(seq1[i])
i += 1
j += 1
if i < len1:
result.extend(seq1[i:])
elif j < len2:
result.extend(seq2[j:])
return result
def merge_pairwise(seqs):
result = []
for seq in seqs:
result = join_sorted(result, seq)
return result
merge_loop()
implements a generalization of the above, where now pass is done only once for all sequences, instead of doing this pairwise.
def merge_loop(seqs):
result = []
lengths = list(map(len, seqs))
idxs = [0] * len(seqs)
while any(idx < length for idx, length in zip(idxs, lengths)):
item = min(
seq[idx]
for idx, seq, length in zip(idxs, seqs, lengths) if idx < length)
result.append(item)
for i, (idx, seq, length) in enumerate(zip(idxs, seqs, lengths)):
if idx < length and seq[idx] == item:
idxs[i] += 1
return result
Benchmarks
By generating the input using:
def gen_input(n, m=100, a=None, b=None):
if a is None and b is None:
b = 2 * n * m
a = -b
return tuple(tuple(sorted(set(random.randint(int(a), int(b)) for _ in range(n)))) for __ in range(m))
one can plot the timings for varying n
:
Note that, in general, the performances will vary for different values of n
(the size of each sequence) and m
(the number of sequences), but also of a
and b
(the minimum and the maximum number generated).
For brevity, this was not explored in this answer, but feel free to play around with it here, which also includes some other implementations, notably some tentative speed-ups with Cython that were only partially successful.