Get difference between two lists with Unique Entries
Question:
I have two lists in Python:
temp1 = ['One', 'Two', 'Three', 'Four']
temp2 = ['One', 'Two']
Assuming the elements in each list are unique, I want to create a third list with items from the first list which are not in the second list:
temp3 = ['Three', 'Four']
Are there any fast ways without cycles and checking?
Answers:
Try this:
temp3 = set(temp1) - set(temp2)
To get elements which are in temp1
but not in temp2
(assuming uniqueness of the elements in each list):
In [5]: list(set(temp1) - set(temp2))
Out[5]: ['Four', 'Three']
Beware that it is asymmetric :
In [5]: set([1, 2]) - set([2, 3])
Out[5]: set([1])
where you might expect/want it to equal set([1, 3])
. If you do want set([1, 3])
as your answer, you can use set([1, 2]).symmetric_difference(set([2, 3]))
.
You could use list comprehension:
temp3 = [item for item in temp1 if item not in temp2]
i’ll toss in since none of the present solutions yield a tuple:
temp3 = tuple(set(temp1) - set(temp2))
alternatively:
#edited using @Mark Byers idea. If you accept this one as answer, just accept his instead.
temp3 = tuple(x for x in temp1 if x not in set(temp2))
Like the other non-tuple yielding answers in this direction, it preserves order
The existing solutions all offer either one or the other of:
- Faster than O(n*m) performance.
- Preserve order of input list.
But so far no solution has both. If you want both, try this:
s = set(temp2)
temp3 = [x for x in temp1 if x not in s]
Performance test
import timeit
init = 'temp1 = list(range(100)); temp2 = [i * 2 for i in range(50)]'
print timeit.timeit('list(set(temp1) - set(temp2))', init, number = 100000)
print timeit.timeit('s = set(temp2);[x for x in temp1 if x not in s]', init, number = 100000)
print timeit.timeit('[item for item in temp1 if item not in temp2]', init, number = 100000)
Results:
4.34620224079 # ars' answer
4.2770634955 # This answer
30.7715615392 # matt b's answer
The method I presented as well as preserving order is also (slightly) faster than the set subtraction because it doesn’t require construction of an unnecessary set. The performance difference would be more noticable if the first list is considerably longer than the second and if hashing is expensive. Here’s a second test demonstrating this:
init = '''
temp1 = [str(i) for i in range(100000)]
temp2 = [str(i * 2) for i in range(50)]
'''
Results:
11.3836875916 # ars' answer
3.63890368748 # this answer (3 times faster!)
37.7445402279 # matt b's answer
this could be even faster than Mark’s list comprehension:
list(itertools.filterfalse(set(temp2).__contains__, temp1))
The difference between two lists (say list1 and list2) can be found using the following simple function.
def diff(list1, list2):
c = set(list1).union(set(list2)) # or c = set(list1) | set(list2)
d = set(list1).intersection(set(list2)) # or d = set(list1) & set(list2)
return list(c - d)
or
def diff(list1, list2):
return list(set(list1).symmetric_difference(set(list2))) # or return list(set(list1) ^ set(list2))
By Using the above function, the difference can be found using diff(temp2, temp1)
or diff(temp1, temp2)
. Both will give the result ['Four', 'Three']
. You don’t have to worry about the order of the list or which list is to be given first.
This is another solution:
def diff(a, b):
xa = [i for i in set(a) if i not in b]
xb = [i for i in set(b) if i not in a]
return xa + xb
You could use a naive method if the elements of the difflist are sorted and sets.
list1=[1,2,3,4,5]
list2=[1,2,3]
print list1[len(list2):]
or with native set methods:
subset=set(list1).difference(list2)
print subset
import timeit
init = 'temp1 = list(range(100)); temp2 = [i * 2 for i in range(50)]'
print "Naive solution: ", timeit.timeit('temp1[len(temp2):]', init, number = 100000)
print "Native set solution: ", timeit.timeit('set(temp1).difference(temp2)', init, number = 100000)
Naive solution: 0.0787101593292
Native set solution: 0.998837615564
single line version of arulmr solution
def diff(listA, listB):
return set(listA) - set(listB) | set(listB) -set(listA)
If you run into TypeError: unhashable type: 'list'
you need to turn lists or sets into tuples, e.g.
set(map(tuple, list_of_lists1)).symmetric_difference(set(map(tuple, list_of_lists2)))
In case you want the difference recursively, I have written a package for python:
https://github.com/seperman/deepdiff
Installation
Install from PyPi:
pip install deepdiff
Example usage
Importing
>>> from deepdiff import DeepDiff
>>> from pprint import pprint
>>> from __future__ import print_function # In case running on Python 2
Same object returns empty
>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = t1
>>> print(DeepDiff(t1, t2))
{}
Type of an item has changed
>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:"2", 3:3}
>>> pprint(DeepDiff(t1, t2), indent=2)
{ 'type_changes': { 'root[2]': { 'newtype': <class 'str'>,
'newvalue': '2',
'oldtype': <class 'int'>,
'oldvalue': 2}}}
Value of an item has changed
>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:4, 3:3}
>>> pprint(DeepDiff(t1, t2), indent=2)
{'values_changed': {'root[2]': {'newvalue': 4, 'oldvalue': 2}}}
Item added and/or removed
>>> t1 = {1:1, 2:2, 3:3, 4:4}
>>> t2 = {1:1, 2:4, 3:3, 5:5, 6:6}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff)
{'dic_item_added': ['root[5]', 'root[6]'],
'dic_item_removed': ['root[4]'],
'values_changed': {'root[2]': {'newvalue': 4, 'oldvalue': 2}}}
String difference
>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world"}}
>>> t2 = {1:1, 2:4, 3:3, 4:{"a":"hello", "b":"world!"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'values_changed': { 'root[2]': {'newvalue': 4, 'oldvalue': 2},
"root[4]['b']": { 'newvalue': 'world!',
'oldvalue': 'world'}}}
String difference 2
>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world!nGoodbye!n1n2nEnd"}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"worldn1n2nEnd"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'values_changed': { "root[4]['b']": { 'diff': '--- n'
'+++ n'
'@@ -1,5 +1,4 @@n'
'-world!n'
'-Goodbye!n'
'+worldn'
' 1n'
' 2n'
' End',
'newvalue': 'worldn1n2nEnd',
'oldvalue': 'world!n'
'Goodbye!n'
'1n'
'2n'
'End'}}}
>>>
>>> print (ddiff['values_changed']["root[4]['b']"]["diff"])
---
+++
@@ -1,5 +1,4 @@
-world!
-Goodbye!
+world
1
2
End
Type change
>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"worldnnnEnd"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'type_changes': { "root[4]['b']": { 'newtype': <class 'str'>,
'newvalue': 'worldnnnEnd',
'oldtype': <class 'list'>,
'oldvalue': [1, 2, 3]}}}
List difference
>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3, 4]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{'iterable_item_removed': {"root[4]['b'][2]": 3, "root[4]['b'][3]": 4}}
List difference 2:
>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 3, 2, 3]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'iterable_item_added': {"root[4]['b'][3]": 3},
'values_changed': { "root[4]['b'][1]": {'newvalue': 3, 'oldvalue': 2},
"root[4]['b'][2]": {'newvalue': 2, 'oldvalue': 3}}}
List difference ignoring order or duplicates: (with the same dictionaries as above)
>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 3, 2, 3]}}
>>> ddiff = DeepDiff(t1, t2, ignore_order=True)
>>> print (ddiff)
{}
List that contains dictionary:
>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:1, 2:2}]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:3}]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'dic_item_removed': ["root[4]['b'][2][2]"],
'values_changed': {"root[4]['b'][2][1]": {'newvalue': 3, 'oldvalue': 1}}}
Sets:
>>> t1 = {1, 2, 8}
>>> t2 = {1, 2, 3, 5}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (DeepDiff(t1, t2))
{'set_item_added': ['root[3]', 'root[5]'], 'set_item_removed': ['root[8]']}
Named Tuples:
>>> from collections import namedtuple
>>> Point = namedtuple('Point', ['x', 'y'])
>>> t1 = Point(x=11, y=22)
>>> t2 = Point(x=11, y=23)
>>> pprint (DeepDiff(t1, t2))
{'values_changed': {'root.y': {'newvalue': 23, 'oldvalue': 22}}}
Custom objects:
>>> class ClassA(object):
... a = 1
... def __init__(self, b):
... self.b = b
...
>>> t1 = ClassA(1)
>>> t2 = ClassA(2)
>>>
>>> pprint(DeepDiff(t1, t2))
{'values_changed': {'root.b': {'newvalue': 2, 'oldvalue': 1}}}
Object attribute added:
>>> t2.c = "new attribute"
>>> pprint(DeepDiff(t1, t2))
{'attribute_added': ['root.c'],
'values_changed': {'root.b': {'newvalue': 2, 'oldvalue': 1}}}
If you are really looking into performance, then use numpy!
Here is the full notebook as a gist on github with comparison between list, numpy, and pandas.
https://gist.github.com/denfromufa/2821ff59b02e9482be15d27f2bbd4451
if you want something more like a changeset… could use Counter
from collections import Counter
def diff(a, b):
""" more verbose than needs to be, for clarity """
ca, cb = Counter(a), Counter(b)
to_add = cb - ca
to_remove = ca - cb
changes = Counter(to_add)
changes.subtract(to_remove)
return changes
lista = ['one', 'three', 'four', 'four', 'one']
listb = ['one', 'two', 'three']
In [127]: diff(lista, listb)
Out[127]: Counter({'two': 1, 'one': -1, 'four': -2})
# in order to go from lista to list b, you need to add a "two", remove a "one", and remove two "four"s
In [128]: diff(listb, lista)
Out[128]: Counter({'four': 2, 'one': 1, 'two': -1})
# in order to go from listb to lista, you must add two "four"s, add a "one", and remove a "two"
I wanted something that would take two lists and could do what diff
in bash
does. Since this question pops up first when you search for “python diff two lists” and is not very specific, I will post what I came up with.
Using SequenceMather
from difflib
you can compare two lists like diff
does. None of the other answers will tell you the position where the difference occurs, but this one does. Some answers give the difference in only one direction. Some reorder the elements. Some don’t handle duplicates. But this solution gives you a true difference between two lists:
a = 'A quick fox jumps the lazy dog'.split()
b = 'A quick brown mouse jumps over the dog'.split()
from difflib import SequenceMatcher
for tag, i, j, k, l in SequenceMatcher(None, a, b).get_opcodes():
if tag == 'equal': print('both have', a[i:j])
if tag in ('delete', 'replace'): print(' 1st has', a[i:j])
if tag in ('insert', 'replace'): print(' 2nd has', b[k:l])
This outputs:
both have ['A', 'quick']
1st has ['fox']
2nd has ['brown', 'mouse']
both have ['jumps']
2nd has ['over']
both have ['the']
1st has ['lazy']
both have ['dog']
Of course, if your application makes the same assumptions the other answers make, you will benefit from them the most. But if you are looking for a true diff
functionality, then this is the only way to go.
For example, none of the other answers could handle:
a = [1,2,3,4,5]
b = [5,4,3,2,1]
But this one does:
2nd has [5, 4, 3, 2]
both have [1]
1st has [2, 3, 4, 5]
Can be done using python XOR operator.
- This will remove the duplicates in each list
- This will show difference of temp1 from temp2 and temp2 from temp1.
set(temp1) ^ set(temp2)
Here’s a Counter
answer for the simplest case.
This is shorter than the one above that does two-way diffs because it only does exactly what the question asks: generate a list of what’s in the first list but not the second.
from collections import Counter
lst1 = ['One', 'Two', 'Three', 'Four']
lst2 = ['One', 'Two']
c1 = Counter(lst1)
c2 = Counter(lst2)
diff = list((c1 - c2).elements())
Alternatively, depending on your readability preferences, it makes for a decent one-liner:
diff = list((Counter(lst1) - Counter(lst2)).elements())
Output:
['Three', 'Four']
Note that you can remove the list(...)
call if you are just iterating over it.
Because this solution uses counters, it handles quantities properly vs the many set-based answers. For example on this input:
lst1 = ['One', 'Two', 'Two', 'Two', 'Three', 'Three', 'Four']
lst2 = ['One', 'Two']
The output is:
['Two', 'Two', 'Three', 'Three', 'Four']
We can calculate intersection minus union of lists:
temp1 = ['One', 'Two', 'Three', 'Four']
temp2 = ['One', 'Two', 'Five']
set(temp1+temp2)-(set(temp1)&set(temp2))
Out: set(['Four', 'Five', 'Three'])
This can be solved with one line.
The question is given two lists (temp1 and temp2) return their difference in a third list (temp3).
temp3 = list(set(temp1).difference(set(temp2)))
I am little too late in the game for this but you can do a comparison of performance of some of the above mentioned code with this, two of the fastest contenders are,
list(set(x).symmetric_difference(set(y)))
list(set(x) ^ set(y))
I apologize for the elementary level of coding.
import time
import random
from itertools import filterfalse
# 1 - performance (time taken)
# 2 - correctness (answer - 1,4,5,6)
# set performance
performance = 1
numberoftests = 7
def answer(x,y,z):
if z == 0:
start = time.clock()
lists = (str(list(set(x)-set(y))+list(set(y)-set(y))))
times = ("1 = " + str(time.clock() - start))
return (lists,times)
elif z == 1:
start = time.clock()
lists = (str(list(set(x).symmetric_difference(set(y)))))
times = ("2 = " + str(time.clock() - start))
return (lists,times)
elif z == 2:
start = time.clock()
lists = (str(list(set(x) ^ set(y))))
times = ("3 = " + str(time.clock() - start))
return (lists,times)
elif z == 3:
start = time.clock()
lists = (filterfalse(set(y).__contains__, x))
times = ("4 = " + str(time.clock() - start))
return (lists,times)
elif z == 4:
start = time.clock()
lists = (tuple(set(x) - set(y)))
times = ("5 = " + str(time.clock() - start))
return (lists,times)
elif z == 5:
start = time.clock()
lists = ([tt for tt in x if tt not in y])
times = ("6 = " + str(time.clock() - start))
return (lists,times)
else:
start = time.clock()
Xarray = [iDa for iDa in x if iDa not in y]
Yarray = [iDb for iDb in y if iDb not in x]
lists = (str(Xarray + Yarray))
times = ("7 = " + str(time.clock() - start))
return (lists,times)
n = numberoftests
if performance == 2:
a = [1,2,3,4,5]
b = [3,2,6]
for c in range(0,n):
d = answer(a,b,c)
print(d[0])
elif performance == 1:
for tests in range(0,10):
print("Test Number" + str(tests + 1))
a = random.sample(range(1, 900000), 9999)
b = random.sample(range(1, 900000), 9999)
for c in range(0,n):
#if c not in (1,4,5,6):
d = answer(a,b,c)
print(d[1])
most simple way,
use set().difference(set())
list_a = [1,2,3]
list_b = [2,3]
print set(list_a).difference(set(list_b))
answer is set([1])
can print as a list,
print list(set(list_a).difference(set(list_b)))
Here are a few simple, order-preserving ways of diffing two lists of strings.
Code
An unusual approach using pathlib
:
import pathlib
temp1 = ["One", "Two", "Three", "Four"]
temp2 = ["One", "Two"]
p = pathlib.Path(*temp1)
r = p.relative_to(*temp2)
list(r.parts)
# ['Three', 'Four']
This assumes both lists contain strings with equivalent beginnings. See the docs for more details. Note, it is not particularly fast compared to set operations.
A straight-forward implementation using itertools.zip_longest
:
import itertools as it
[x for x, y in it.zip_longest(temp1, temp2) if x != y]
# ['Three', 'Four']
Here is an simple way to distinguish two lists (whatever the contents are), you can get the result as shown below :
>>> from sets import Set
>>>
>>> l1 = ['xvda', False, 'xvdbb', 12, 'xvdbc']
>>> l2 = ['xvda', 'xvdbb', 'xvdbc', 'xvdbd', None]
>>>
>>> Set(l1).symmetric_difference(Set(l2))
Set([False, 'xvdbd', None, 12])
Hope this will helpful.
Let’s say we have two lists
list1 = [1, 3, 5, 7, 9]
list2 = [1, 2, 3, 4, 5]
we can see from the above two lists that items 1, 3, 5 exist in list2 and items 7, 9 do not. On the other hand, items 1, 3, 5 exist in list1 and items 2, 4 do not.
What is the best solution to return a new list containing items 7, 9 and 2, 4?
All answers above find the solution, now whats the most optimal?
def difference(list1, list2):
new_list = []
for i in list1:
if i not in list2:
new_list.append(i)
for j in list2:
if j not in list1:
new_list.append(j)
return new_list
versus
def sym_diff(list1, list2):
return list(set(list1).symmetric_difference(set(list2)))
Using timeit we can see the results
t1 = timeit.Timer("difference(list1, list2)", "from __main__ import difference,
list1, list2")
t2 = timeit.Timer("sym_diff(list1, list2)", "from __main__ import sym_diff,
list1, list2")
print('Using two for loops', t1.timeit(number=100000), 'Milliseconds')
print('Using two for loops', t2.timeit(number=100000), 'Milliseconds')
returns
[7, 9, 2, 4]
Using two for loops 0.11572412995155901 Milliseconds
Using symmetric_difference 0.11285737506113946 Milliseconds
Process finished with exit code 0
def diffList(list1, list2): # returns the difference between two lists.
if len(list1) > len(list2):
return (list(set(list1) - set(list2)))
else:
return (list(set(list2) - set(list1)))
e.g. if list1 = [10, 15, 20, 25, 30, 35, 40]
and list2 = [25, 40, 35]
then the returned list will be output = [10, 20, 30, 15]
I prefer to use converting to sets and then using the “difference()” function. The full code is :
temp1 = ['One', 'Two', 'Three', 'Four' ]
temp2 = ['One', 'Two']
set1 = set(temp1)
set2 = set(temp2)
set3 = set1.difference(set2)
temp3 = list(set3)
print(temp3)
Output:
>>>print(temp3)
['Three', 'Four']
It’s the easiest to undersand, and morover in future if you work with large data, converting it to sets will remove duplicates if duplicates are not required. Hope it helps 😉
You can cycle through the first list and, for every item that isn’t in the second list but is in the first list, add it to the third list. E.g:
temp3 = []
for i in temp1:
if i not in temp2:
temp3.append(i)
print(temp3)
I know this question got great answers already but I wish to add the following method using numpy
.
temp1 = ['One', 'Two', 'Three', 'Four']
temp2 = ['One', 'Two']
list(np.setdiff1d(temp1,temp2))
['Four', 'Three'] #Output
If you should remove all values from list a, which are present in list b.
def list_diff(a, b):
r = []
for i in a:
if i not in b:
r.append(i)
return r
list_diff([1,2,2], [1])
Result: [2,2]
or
def list_diff(a, b):
return [x for x in a if x not in b]
Here is a modified version of @SuperNova’s answer
def get_diff(a: list, b: list) -> list:
return list(set(a) ^ set(b))
Following on @arkolec’s answer, here is a utility class for comparing lists, tuples and sets:
from difflib import SequenceMatcher
class ListDiffer:
def __init__(self, left, right, strict_bool=False):
assert isinstance(left, (list, tuple, set)), "left must be list, tuple or set"
assert isinstance(right, (list, tuple, set)), "right must be list, tuple or set"
self.l = list(left) if isinstance(left, (tuple, set)) else left
self.r = list(right) if isinstance(left, (tuple, set)) else right
if strict:
assert isinstance(left, right.__class__),
f'left type ({left.__class__.__name__}) must equal right type ({right.__class__.__name__})'
self.diffs = []
self.equal = []
for tag, i, j, k, l in SequenceMatcher(None, self.l, self.r).get_opcodes():
if tag in ['delete', 'replace', 'insert']:
self.diffs.append((tag, i, j, k, l))
elif tag == 'equal':
[self.equal.append(v) for v in left[i:j]]
def has_diffs(self):
return len(self.diffs) > 0
def only_left(self):
a = self.l[:]
[a.remove(v) for v in self.equal]
return a
def only_right(self):
a = self.r[:]
[a.remove(v) for v in self.equal]
return a
def __str__(self, verbose_bool=False):
iD = 0
sb = []
if verbose:
sb.append(f"left: {self.l}n")
sb.append(f"right: {self.r}n")
sb.append(f"diffs: ")
for tag, i, j, k, l in self.diffs:
s = f"({iD})"
if iD > 0: sb.append(' | ')
if tag in ('delete', 'replace'): s = f'{s} l:{self.l[i:j]}'
if tag in ('insert', 'replace'): s = f'{s} r:{self.r[k:l]}'
sb.append(s)
iD = iD + 1
if verbose:
sb.append(f"nequal: {self.equal}")
return ''.join(sb)
def __repr__(self) -> str:
return "<ListDiffer> {}".format(self.__str__())
Usage:
left = ['a','b','c']
right = ['aa','b','c','d']
# right = ('aa','b','c','d')
ld = ListDiffer(left, right, strict=True)
print(f'ld.has_diffs(): {ld.has_diffs()}')
print(f'ld: {ld}')
print(f'ld.only_left(): {ld.only_left()}')
print(f'ld.only_right(): {ld.only_right()}')
Output:
ld.has_diffs(): True
ld: (0) l:['a'] r:['aa'] | (1) r:['d']
ld.only_left(): ['a']
ld.only_right(): ['aa', 'd']
I cannot speak to performance but you could use ld.only_left()
to get the result you are looking for.
If the lists are of objects and not primitive types, this is one way of doing it.
The code is more explicit and gives out a copy.
This may not be an efficient implementation, but clean for smaller lists of objects.
a = [
{'id1': 1, 'id2': 'A'},
{'id1': 1, 'id2': 'B'},
{'id1': 1, 'id2': 'C'}, # out
{'id1': 2, 'id2': 'A'},
{'id1': 2, 'id2': 'B'}, # out
]
b = [
{'id1': 1, 'id2': 'A'},
{'id1': 1, 'id2': 'B'},
{'id1': 2, 'id2': 'A'},
]
def difference(a, b):
for x in a:
for y in b:
if x['id1'] == y['id1'] and x['id2'] == y['id2']:
x['is_removed'] = True
c = [x for x in a if not x.get('is_removed', False)]
return c
print(difference(a, b))
I tried to time my method against the accepted answer while using perf_counter. While testing, this method would run faster. The temp3 list outputs to 14571 items. If anyone would like to test this and leave feedback, please do so. Code below:
control_set = {item for item in temp1}
temp3 = [x for x in temp2 if x not in control_set]
I have two lists in Python:
temp1 = ['One', 'Two', 'Three', 'Four']
temp2 = ['One', 'Two']
Assuming the elements in each list are unique, I want to create a third list with items from the first list which are not in the second list:
temp3 = ['Three', 'Four']
Are there any fast ways without cycles and checking?
Try this:
temp3 = set(temp1) - set(temp2)
To get elements which are in temp1
but not in temp2
(assuming uniqueness of the elements in each list):
In [5]: list(set(temp1) - set(temp2))
Out[5]: ['Four', 'Three']
Beware that it is asymmetric :
In [5]: set([1, 2]) - set([2, 3])
Out[5]: set([1])
where you might expect/want it to equal set([1, 3])
. If you do want set([1, 3])
as your answer, you can use set([1, 2]).symmetric_difference(set([2, 3]))
.
You could use list comprehension:
temp3 = [item for item in temp1 if item not in temp2]
i’ll toss in since none of the present solutions yield a tuple:
temp3 = tuple(set(temp1) - set(temp2))
alternatively:
#edited using @Mark Byers idea. If you accept this one as answer, just accept his instead.
temp3 = tuple(x for x in temp1 if x not in set(temp2))
Like the other non-tuple yielding answers in this direction, it preserves order
The existing solutions all offer either one or the other of:
- Faster than O(n*m) performance.
- Preserve order of input list.
But so far no solution has both. If you want both, try this:
s = set(temp2)
temp3 = [x for x in temp1 if x not in s]
Performance test
import timeit
init = 'temp1 = list(range(100)); temp2 = [i * 2 for i in range(50)]'
print timeit.timeit('list(set(temp1) - set(temp2))', init, number = 100000)
print timeit.timeit('s = set(temp2);[x for x in temp1 if x not in s]', init, number = 100000)
print timeit.timeit('[item for item in temp1 if item not in temp2]', init, number = 100000)
Results:
4.34620224079 # ars' answer
4.2770634955 # This answer
30.7715615392 # matt b's answer
The method I presented as well as preserving order is also (slightly) faster than the set subtraction because it doesn’t require construction of an unnecessary set. The performance difference would be more noticable if the first list is considerably longer than the second and if hashing is expensive. Here’s a second test demonstrating this:
init = '''
temp1 = [str(i) for i in range(100000)]
temp2 = [str(i * 2) for i in range(50)]
'''
Results:
11.3836875916 # ars' answer
3.63890368748 # this answer (3 times faster!)
37.7445402279 # matt b's answer
this could be even faster than Mark’s list comprehension:
list(itertools.filterfalse(set(temp2).__contains__, temp1))
The difference between two lists (say list1 and list2) can be found using the following simple function.
def diff(list1, list2):
c = set(list1).union(set(list2)) # or c = set(list1) | set(list2)
d = set(list1).intersection(set(list2)) # or d = set(list1) & set(list2)
return list(c - d)
or
def diff(list1, list2):
return list(set(list1).symmetric_difference(set(list2))) # or return list(set(list1) ^ set(list2))
By Using the above function, the difference can be found using diff(temp2, temp1)
or diff(temp1, temp2)
. Both will give the result ['Four', 'Three']
. You don’t have to worry about the order of the list or which list is to be given first.
This is another solution:
def diff(a, b):
xa = [i for i in set(a) if i not in b]
xb = [i for i in set(b) if i not in a]
return xa + xb
You could use a naive method if the elements of the difflist are sorted and sets.
list1=[1,2,3,4,5]
list2=[1,2,3]
print list1[len(list2):]
or with native set methods:
subset=set(list1).difference(list2)
print subset
import timeit
init = 'temp1 = list(range(100)); temp2 = [i * 2 for i in range(50)]'
print "Naive solution: ", timeit.timeit('temp1[len(temp2):]', init, number = 100000)
print "Native set solution: ", timeit.timeit('set(temp1).difference(temp2)', init, number = 100000)
Naive solution: 0.0787101593292
Native set solution: 0.998837615564
single line version of arulmr solution
def diff(listA, listB):
return set(listA) - set(listB) | set(listB) -set(listA)
If you run into TypeError: unhashable type: 'list'
you need to turn lists or sets into tuples, e.g.
set(map(tuple, list_of_lists1)).symmetric_difference(set(map(tuple, list_of_lists2)))
In case you want the difference recursively, I have written a package for python:
https://github.com/seperman/deepdiff
Installation
Install from PyPi:
pip install deepdiff
Example usage
Importing
>>> from deepdiff import DeepDiff
>>> from pprint import pprint
>>> from __future__ import print_function # In case running on Python 2
Same object returns empty
>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = t1
>>> print(DeepDiff(t1, t2))
{}
Type of an item has changed
>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:"2", 3:3}
>>> pprint(DeepDiff(t1, t2), indent=2)
{ 'type_changes': { 'root[2]': { 'newtype': <class 'str'>,
'newvalue': '2',
'oldtype': <class 'int'>,
'oldvalue': 2}}}
Value of an item has changed
>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:4, 3:3}
>>> pprint(DeepDiff(t1, t2), indent=2)
{'values_changed': {'root[2]': {'newvalue': 4, 'oldvalue': 2}}}
Item added and/or removed
>>> t1 = {1:1, 2:2, 3:3, 4:4}
>>> t2 = {1:1, 2:4, 3:3, 5:5, 6:6}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff)
{'dic_item_added': ['root[5]', 'root[6]'],
'dic_item_removed': ['root[4]'],
'values_changed': {'root[2]': {'newvalue': 4, 'oldvalue': 2}}}
String difference
>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world"}}
>>> t2 = {1:1, 2:4, 3:3, 4:{"a":"hello", "b":"world!"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'values_changed': { 'root[2]': {'newvalue': 4, 'oldvalue': 2},
"root[4]['b']": { 'newvalue': 'world!',
'oldvalue': 'world'}}}
String difference 2
>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world!nGoodbye!n1n2nEnd"}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"worldn1n2nEnd"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'values_changed': { "root[4]['b']": { 'diff': '--- n'
'+++ n'
'@@ -1,5 +1,4 @@n'
'-world!n'
'-Goodbye!n'
'+worldn'
' 1n'
' 2n'
' End',
'newvalue': 'worldn1n2nEnd',
'oldvalue': 'world!n'
'Goodbye!n'
'1n'
'2n'
'End'}}}
>>>
>>> print (ddiff['values_changed']["root[4]['b']"]["diff"])
---
+++
@@ -1,5 +1,4 @@
-world!
-Goodbye!
+world
1
2
End
Type change
>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"worldnnnEnd"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'type_changes': { "root[4]['b']": { 'newtype': <class 'str'>,
'newvalue': 'worldnnnEnd',
'oldtype': <class 'list'>,
'oldvalue': [1, 2, 3]}}}
List difference
>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3, 4]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{'iterable_item_removed': {"root[4]['b'][2]": 3, "root[4]['b'][3]": 4}}
List difference 2:
>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 3, 2, 3]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'iterable_item_added': {"root[4]['b'][3]": 3},
'values_changed': { "root[4]['b'][1]": {'newvalue': 3, 'oldvalue': 2},
"root[4]['b'][2]": {'newvalue': 2, 'oldvalue': 3}}}
List difference ignoring order or duplicates: (with the same dictionaries as above)
>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 3, 2, 3]}}
>>> ddiff = DeepDiff(t1, t2, ignore_order=True)
>>> print (ddiff)
{}
List that contains dictionary:
>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:1, 2:2}]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:3}]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'dic_item_removed': ["root[4]['b'][2][2]"],
'values_changed': {"root[4]['b'][2][1]": {'newvalue': 3, 'oldvalue': 1}}}
Sets:
>>> t1 = {1, 2, 8}
>>> t2 = {1, 2, 3, 5}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (DeepDiff(t1, t2))
{'set_item_added': ['root[3]', 'root[5]'], 'set_item_removed': ['root[8]']}
Named Tuples:
>>> from collections import namedtuple
>>> Point = namedtuple('Point', ['x', 'y'])
>>> t1 = Point(x=11, y=22)
>>> t2 = Point(x=11, y=23)
>>> pprint (DeepDiff(t1, t2))
{'values_changed': {'root.y': {'newvalue': 23, 'oldvalue': 22}}}
Custom objects:
>>> class ClassA(object):
... a = 1
... def __init__(self, b):
... self.b = b
...
>>> t1 = ClassA(1)
>>> t2 = ClassA(2)
>>>
>>> pprint(DeepDiff(t1, t2))
{'values_changed': {'root.b': {'newvalue': 2, 'oldvalue': 1}}}
Object attribute added:
>>> t2.c = "new attribute"
>>> pprint(DeepDiff(t1, t2))
{'attribute_added': ['root.c'],
'values_changed': {'root.b': {'newvalue': 2, 'oldvalue': 1}}}
If you are really looking into performance, then use numpy!
Here is the full notebook as a gist on github with comparison between list, numpy, and pandas.
https://gist.github.com/denfromufa/2821ff59b02e9482be15d27f2bbd4451
if you want something more like a changeset… could use Counter
from collections import Counter
def diff(a, b):
""" more verbose than needs to be, for clarity """
ca, cb = Counter(a), Counter(b)
to_add = cb - ca
to_remove = ca - cb
changes = Counter(to_add)
changes.subtract(to_remove)
return changes
lista = ['one', 'three', 'four', 'four', 'one']
listb = ['one', 'two', 'three']
In [127]: diff(lista, listb)
Out[127]: Counter({'two': 1, 'one': -1, 'four': -2})
# in order to go from lista to list b, you need to add a "two", remove a "one", and remove two "four"s
In [128]: diff(listb, lista)
Out[128]: Counter({'four': 2, 'one': 1, 'two': -1})
# in order to go from listb to lista, you must add two "four"s, add a "one", and remove a "two"
I wanted something that would take two lists and could do what diff
in bash
does. Since this question pops up first when you search for “python diff two lists” and is not very specific, I will post what I came up with.
Using SequenceMather
from difflib
you can compare two lists like diff
does. None of the other answers will tell you the position where the difference occurs, but this one does. Some answers give the difference in only one direction. Some reorder the elements. Some don’t handle duplicates. But this solution gives you a true difference between two lists:
a = 'A quick fox jumps the lazy dog'.split()
b = 'A quick brown mouse jumps over the dog'.split()
from difflib import SequenceMatcher
for tag, i, j, k, l in SequenceMatcher(None, a, b).get_opcodes():
if tag == 'equal': print('both have', a[i:j])
if tag in ('delete', 'replace'): print(' 1st has', a[i:j])
if tag in ('insert', 'replace'): print(' 2nd has', b[k:l])
This outputs:
both have ['A', 'quick']
1st has ['fox']
2nd has ['brown', 'mouse']
both have ['jumps']
2nd has ['over']
both have ['the']
1st has ['lazy']
both have ['dog']
Of course, if your application makes the same assumptions the other answers make, you will benefit from them the most. But if you are looking for a true diff
functionality, then this is the only way to go.
For example, none of the other answers could handle:
a = [1,2,3,4,5]
b = [5,4,3,2,1]
But this one does:
2nd has [5, 4, 3, 2]
both have [1]
1st has [2, 3, 4, 5]
Can be done using python XOR operator.
- This will remove the duplicates in each list
- This will show difference of temp1 from temp2 and temp2 from temp1.
set(temp1) ^ set(temp2)
Here’s a Counter
answer for the simplest case.
This is shorter than the one above that does two-way diffs because it only does exactly what the question asks: generate a list of what’s in the first list but not the second.
from collections import Counter
lst1 = ['One', 'Two', 'Three', 'Four']
lst2 = ['One', 'Two']
c1 = Counter(lst1)
c2 = Counter(lst2)
diff = list((c1 - c2).elements())
Alternatively, depending on your readability preferences, it makes for a decent one-liner:
diff = list((Counter(lst1) - Counter(lst2)).elements())
Output:
['Three', 'Four']
Note that you can remove the list(...)
call if you are just iterating over it.
Because this solution uses counters, it handles quantities properly vs the many set-based answers. For example on this input:
lst1 = ['One', 'Two', 'Two', 'Two', 'Three', 'Three', 'Four']
lst2 = ['One', 'Two']
The output is:
['Two', 'Two', 'Three', 'Three', 'Four']
We can calculate intersection minus union of lists:
temp1 = ['One', 'Two', 'Three', 'Four']
temp2 = ['One', 'Two', 'Five']
set(temp1+temp2)-(set(temp1)&set(temp2))
Out: set(['Four', 'Five', 'Three'])
This can be solved with one line.
The question is given two lists (temp1 and temp2) return their difference in a third list (temp3).
temp3 = list(set(temp1).difference(set(temp2)))
I am little too late in the game for this but you can do a comparison of performance of some of the above mentioned code with this, two of the fastest contenders are,
list(set(x).symmetric_difference(set(y)))
list(set(x) ^ set(y))
I apologize for the elementary level of coding.
import time
import random
from itertools import filterfalse
# 1 - performance (time taken)
# 2 - correctness (answer - 1,4,5,6)
# set performance
performance = 1
numberoftests = 7
def answer(x,y,z):
if z == 0:
start = time.clock()
lists = (str(list(set(x)-set(y))+list(set(y)-set(y))))
times = ("1 = " + str(time.clock() - start))
return (lists,times)
elif z == 1:
start = time.clock()
lists = (str(list(set(x).symmetric_difference(set(y)))))
times = ("2 = " + str(time.clock() - start))
return (lists,times)
elif z == 2:
start = time.clock()
lists = (str(list(set(x) ^ set(y))))
times = ("3 = " + str(time.clock() - start))
return (lists,times)
elif z == 3:
start = time.clock()
lists = (filterfalse(set(y).__contains__, x))
times = ("4 = " + str(time.clock() - start))
return (lists,times)
elif z == 4:
start = time.clock()
lists = (tuple(set(x) - set(y)))
times = ("5 = " + str(time.clock() - start))
return (lists,times)
elif z == 5:
start = time.clock()
lists = ([tt for tt in x if tt not in y])
times = ("6 = " + str(time.clock() - start))
return (lists,times)
else:
start = time.clock()
Xarray = [iDa for iDa in x if iDa not in y]
Yarray = [iDb for iDb in y if iDb not in x]
lists = (str(Xarray + Yarray))
times = ("7 = " + str(time.clock() - start))
return (lists,times)
n = numberoftests
if performance == 2:
a = [1,2,3,4,5]
b = [3,2,6]
for c in range(0,n):
d = answer(a,b,c)
print(d[0])
elif performance == 1:
for tests in range(0,10):
print("Test Number" + str(tests + 1))
a = random.sample(range(1, 900000), 9999)
b = random.sample(range(1, 900000), 9999)
for c in range(0,n):
#if c not in (1,4,5,6):
d = answer(a,b,c)
print(d[1])
most simple way,
use set().difference(set())
list_a = [1,2,3]
list_b = [2,3]
print set(list_a).difference(set(list_b))
answer is set([1])
can print as a list,
print list(set(list_a).difference(set(list_b)))
Here are a few simple, order-preserving ways of diffing two lists of strings.
Code
An unusual approach using pathlib
:
import pathlib
temp1 = ["One", "Two", "Three", "Four"]
temp2 = ["One", "Two"]
p = pathlib.Path(*temp1)
r = p.relative_to(*temp2)
list(r.parts)
# ['Three', 'Four']
This assumes both lists contain strings with equivalent beginnings. See the docs for more details. Note, it is not particularly fast compared to set operations.
A straight-forward implementation using itertools.zip_longest
:
import itertools as it
[x for x, y in it.zip_longest(temp1, temp2) if x != y]
# ['Three', 'Four']
Here is an simple way to distinguish two lists (whatever the contents are), you can get the result as shown below :
>>> from sets import Set
>>>
>>> l1 = ['xvda', False, 'xvdbb', 12, 'xvdbc']
>>> l2 = ['xvda', 'xvdbb', 'xvdbc', 'xvdbd', None]
>>>
>>> Set(l1).symmetric_difference(Set(l2))
Set([False, 'xvdbd', None, 12])
Hope this will helpful.
Let’s say we have two lists
list1 = [1, 3, 5, 7, 9]
list2 = [1, 2, 3, 4, 5]
we can see from the above two lists that items 1, 3, 5 exist in list2 and items 7, 9 do not. On the other hand, items 1, 3, 5 exist in list1 and items 2, 4 do not.
What is the best solution to return a new list containing items 7, 9 and 2, 4?
All answers above find the solution, now whats the most optimal?
def difference(list1, list2):
new_list = []
for i in list1:
if i not in list2:
new_list.append(i)
for j in list2:
if j not in list1:
new_list.append(j)
return new_list
versus
def sym_diff(list1, list2):
return list(set(list1).symmetric_difference(set(list2)))
Using timeit we can see the results
t1 = timeit.Timer("difference(list1, list2)", "from __main__ import difference,
list1, list2")
t2 = timeit.Timer("sym_diff(list1, list2)", "from __main__ import sym_diff,
list1, list2")
print('Using two for loops', t1.timeit(number=100000), 'Milliseconds')
print('Using two for loops', t2.timeit(number=100000), 'Milliseconds')
returns
[7, 9, 2, 4]
Using two for loops 0.11572412995155901 Milliseconds
Using symmetric_difference 0.11285737506113946 Milliseconds
Process finished with exit code 0
def diffList(list1, list2): # returns the difference between two lists.
if len(list1) > len(list2):
return (list(set(list1) - set(list2)))
else:
return (list(set(list2) - set(list1)))
e.g. if list1 = [10, 15, 20, 25, 30, 35, 40]
and list2 = [25, 40, 35]
then the returned list will be output = [10, 20, 30, 15]
I prefer to use converting to sets and then using the “difference()” function. The full code is :
temp1 = ['One', 'Two', 'Three', 'Four' ]
temp2 = ['One', 'Two']
set1 = set(temp1)
set2 = set(temp2)
set3 = set1.difference(set2)
temp3 = list(set3)
print(temp3)
Output:
>>>print(temp3)
['Three', 'Four']
It’s the easiest to undersand, and morover in future if you work with large data, converting it to sets will remove duplicates if duplicates are not required. Hope it helps 😉
You can cycle through the first list and, for every item that isn’t in the second list but is in the first list, add it to the third list. E.g:
temp3 = []
for i in temp1:
if i not in temp2:
temp3.append(i)
print(temp3)
I know this question got great answers already but I wish to add the following method using numpy
.
temp1 = ['One', 'Two', 'Three', 'Four']
temp2 = ['One', 'Two']
list(np.setdiff1d(temp1,temp2))
['Four', 'Three'] #Output
If you should remove all values from list a, which are present in list b.
def list_diff(a, b):
r = []
for i in a:
if i not in b:
r.append(i)
return r
list_diff([1,2,2], [1])
Result: [2,2]
or
def list_diff(a, b):
return [x for x in a if x not in b]
Here is a modified version of @SuperNova’s answer
def get_diff(a: list, b: list) -> list:
return list(set(a) ^ set(b))
Following on @arkolec’s answer, here is a utility class for comparing lists, tuples and sets:
from difflib import SequenceMatcher
class ListDiffer:
def __init__(self, left, right, strict_bool=False):
assert isinstance(left, (list, tuple, set)), "left must be list, tuple or set"
assert isinstance(right, (list, tuple, set)), "right must be list, tuple or set"
self.l = list(left) if isinstance(left, (tuple, set)) else left
self.r = list(right) if isinstance(left, (tuple, set)) else right
if strict:
assert isinstance(left, right.__class__),
f'left type ({left.__class__.__name__}) must equal right type ({right.__class__.__name__})'
self.diffs = []
self.equal = []
for tag, i, j, k, l in SequenceMatcher(None, self.l, self.r).get_opcodes():
if tag in ['delete', 'replace', 'insert']:
self.diffs.append((tag, i, j, k, l))
elif tag == 'equal':
[self.equal.append(v) for v in left[i:j]]
def has_diffs(self):
return len(self.diffs) > 0
def only_left(self):
a = self.l[:]
[a.remove(v) for v in self.equal]
return a
def only_right(self):
a = self.r[:]
[a.remove(v) for v in self.equal]
return a
def __str__(self, verbose_bool=False):
iD = 0
sb = []
if verbose:
sb.append(f"left: {self.l}n")
sb.append(f"right: {self.r}n")
sb.append(f"diffs: ")
for tag, i, j, k, l in self.diffs:
s = f"({iD})"
if iD > 0: sb.append(' | ')
if tag in ('delete', 'replace'): s = f'{s} l:{self.l[i:j]}'
if tag in ('insert', 'replace'): s = f'{s} r:{self.r[k:l]}'
sb.append(s)
iD = iD + 1
if verbose:
sb.append(f"nequal: {self.equal}")
return ''.join(sb)
def __repr__(self) -> str:
return "<ListDiffer> {}".format(self.__str__())
Usage:
left = ['a','b','c']
right = ['aa','b','c','d']
# right = ('aa','b','c','d')
ld = ListDiffer(left, right, strict=True)
print(f'ld.has_diffs(): {ld.has_diffs()}')
print(f'ld: {ld}')
print(f'ld.only_left(): {ld.only_left()}')
print(f'ld.only_right(): {ld.only_right()}')
Output:
ld.has_diffs(): True
ld: (0) l:['a'] r:['aa'] | (1) r:['d']
ld.only_left(): ['a']
ld.only_right(): ['aa', 'd']
I cannot speak to performance but you could use ld.only_left()
to get the result you are looking for.
If the lists are of objects and not primitive types, this is one way of doing it.
The code is more explicit and gives out a copy.
This may not be an efficient implementation, but clean for smaller lists of objects.
a = [
{'id1': 1, 'id2': 'A'},
{'id1': 1, 'id2': 'B'},
{'id1': 1, 'id2': 'C'}, # out
{'id1': 2, 'id2': 'A'},
{'id1': 2, 'id2': 'B'}, # out
]
b = [
{'id1': 1, 'id2': 'A'},
{'id1': 1, 'id2': 'B'},
{'id1': 2, 'id2': 'A'},
]
def difference(a, b):
for x in a:
for y in b:
if x['id1'] == y['id1'] and x['id2'] == y['id2']:
x['is_removed'] = True
c = [x for x in a if not x.get('is_removed', False)]
return c
print(difference(a, b))
I tried to time my method against the accepted answer while using perf_counter. While testing, this method would run faster. The temp3 list outputs to 14571 items. If anyone would like to test this and leave feedback, please do so. Code below:
control_set = {item for item in temp1}
temp3 = [x for x in temp2 if x not in control_set]