How to efficiently get the mean of the elements in two list of lists in Python
Question:
I have two lists as follows.
mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]
I want to get the mean of the common elements in the two lists as follows.
myoutput = [["chocolate", 0.5], ["egg", 0.45]]
My current code is as follows
for item1 in mylist1:
for item2 in mylist2:
if item1[0] == item2[0]:
print(np.mean([item1[1], item2[1]]))
However, since there are two for
loops (O(n^2)
complexity) this is very inefficient for very long lists. I am wondering if there is more standard/efficient way of doing this in Python.
Answers:
You can do it in O(n) (single pass over each list) by converting 1 to a dict, then per item in the 2nd list access that dict (in O(1)), like this:
mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]
l1_as_dict = dict(mylist1)
myoutput = []
for item,price2 in mylist2:
if item in l1_as_dict:
price1 = l1_as_dict[item]
myoutput.append([item, (price1+price2)/2])
print(myoutput)
Output:
[['chocolate', 0.5], ['egg', 0.45]]
To easily manipulate your values, I’d suggest using a dict
, find the common keys, and compute the mean:
mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]
recipe_1 = dict(mylist1) # {'lemon': 0.1, 'egg': 0.1, 'muffin': 0.3, 'chocolate': 0.5}
recipe_2 = dict(mylist2) # {'chocolate': 0.5, 'milk': 0.2, 'carrot': 0.8, 'egg': 0.8}
common_keys = recipe_1.keys() & recipe_2.keys() # {'chocolate', 'egg'}
myoutput = [[item, np.mean((recipe_1[item], recipe_2[item]))] for item in common_keys]
myoutput = [[item, (recipe_1[item] + recipe_2[item]) / 2] for item in common_keys]
An O(n)
solution that will average all items.
Construct a dictionary with a list of the values and then average that dictionary afterwards:
In []:
d = {}
for lst in (mylist1, mylist2):
for i, v in lst:
d.setdefault(i, []).append(v) # alternative use collections.defaultdict
[(k, sum(v)/len(v)) for k, v in d.items()]
Out[]:
[('lemon', 0.1), ('egg', 0.45), ('muffin', 0.3), ('chocolate', 0.5), ('milk', 0.2), ('carrot', 0.8)]
Then if you just want the common ones you can add a guard:
In []:
[(k, sum(v)/len(v)) for k, v in d.items() if len(v) > 1]
Out[]:
[('egg', 0.45), ('chocolate', 0.5)]
This extends to any number of lists and makes no assumption around the number of common elements.
Convert lists to dicts
d_list1 = dict(mylist1)
d_list2 = dict(mylist2)
[[k, (v+d_list2[k])/2] for k, v in d_list1.items() if k in d_list2]
#[['egg', 0.45], ['chocolate', 0.5]]
You get the common keys from the two lists by using the set
intersection
method and then using a list comprehension calculate the mean:
mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]
dict1 = dict(mylist1)
dict2 = dict(mylist2)
res = [[key, (dict1.get(key)+dict2.get(key))/2] for key in set(dict1.keys()).intersection(set(dict2.keys()))]
print(res)
Output:
>> [['chocolate', 0.5], ['egg', 0.45]]
You can do it in the time required for comuting set intersections which is apparently O(min(N1,N2)) where N1, N2 are the list lengths.
intersect = set([a[0] for a in mylist1]).intersection([a[0] for a in mylist2])
d1=dict(mylist1)
d2=dict(mylist2)
{i:(d1[i]+d2[i])/2 for i in intersect}
You can use the Pandas library to avoid writing any sort of loops yourself.
Your code would be really concise and clean.
Install Pandas like: pip install pandas
.
Then try this:
In [132]: import pandas as pd
In [109]: df1 = pd.DataFrame(mylist1)
In [110]: df2 = pd.DataFrame(mylist2)
In [117]: res = pd.merge(df1, df2, on=0)
In [121]: res['mean'] = res.mean(axis=1)
In [125]: res.drop(['1_x', '1_y'], 1, inplace=True)
In [131]: res.values.tolist()
Out[131]: [['egg', 0.45], ['chocolate', 0.5]]
Edit
Pandas is crazy fast because it uses numpy
under the hood. Numpy implements highly efficient array operations.
Please check the post : Why is Pandas so madly fast? for more details on calculating mean
through pure Python vs Pandas
.
Here is one solution that uses collections.defaultdict
to group the items and calculates the averages with statistics.mean
:
from collections import defaultdict
from statistics import mean
mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]
d = defaultdict(list)
for lst in (mylist1, mylist2):
for k, v in lst:
d[k].append(v)
result = [[k, mean(v)] for k, v in d.items()]
print(result)
# [['lemon', 0.1], ['egg', 0.45], ['muffin', 0.3], ['chocolate', 0.5], ['milk', 0.2], ['carrot', 0.8]]
If we only want common keys, just check if the values are more than 1:
result = [[k, mean(v)] for k, v in d.items() if len(v) > 1]
print(result)
# [['egg', 0.45], ['chocolate', 0.5]]
We could also just build the result from set intersection:
mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]
d1, d2 = dict(mylist1), dict(mylist2)
result = [[k, (d1[k] + d2[k]) / 2] for k in d1.keys() & d2.keys()]
print(result)
# [['egg', 0.45], ['chocolate', 0.5]]
Here is a simple, very Pythonic solution:
result = [[x[0], (x[1] + y[1])/2] for x in mylist1 for y in mylist2 if x[0] == y[0]]
It probably is not the fastest solution, but it is faster by virtue of using Python list comprehension to iterate the lists and, since neither this solution nor the OP’s will work with multiple instances of a list key value, it replaces the np.mean with a simple average of the two values.
I have two lists as follows.
mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]
I want to get the mean of the common elements in the two lists as follows.
myoutput = [["chocolate", 0.5], ["egg", 0.45]]
My current code is as follows
for item1 in mylist1:
for item2 in mylist2:
if item1[0] == item2[0]:
print(np.mean([item1[1], item2[1]]))
However, since there are two for
loops (O(n^2)
complexity) this is very inefficient for very long lists. I am wondering if there is more standard/efficient way of doing this in Python.
You can do it in O(n) (single pass over each list) by converting 1 to a dict, then per item in the 2nd list access that dict (in O(1)), like this:
mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]
l1_as_dict = dict(mylist1)
myoutput = []
for item,price2 in mylist2:
if item in l1_as_dict:
price1 = l1_as_dict[item]
myoutput.append([item, (price1+price2)/2])
print(myoutput)
Output:
[['chocolate', 0.5], ['egg', 0.45]]
To easily manipulate your values, I’d suggest using a dict
, find the common keys, and compute the mean:
mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]
recipe_1 = dict(mylist1) # {'lemon': 0.1, 'egg': 0.1, 'muffin': 0.3, 'chocolate': 0.5}
recipe_2 = dict(mylist2) # {'chocolate': 0.5, 'milk': 0.2, 'carrot': 0.8, 'egg': 0.8}
common_keys = recipe_1.keys() & recipe_2.keys() # {'chocolate', 'egg'}
myoutput = [[item, np.mean((recipe_1[item], recipe_2[item]))] for item in common_keys]
myoutput = [[item, (recipe_1[item] + recipe_2[item]) / 2] for item in common_keys]
An O(n)
solution that will average all items.
Construct a dictionary with a list of the values and then average that dictionary afterwards:
In []:
d = {}
for lst in (mylist1, mylist2):
for i, v in lst:
d.setdefault(i, []).append(v) # alternative use collections.defaultdict
[(k, sum(v)/len(v)) for k, v in d.items()]
Out[]:
[('lemon', 0.1), ('egg', 0.45), ('muffin', 0.3), ('chocolate', 0.5), ('milk', 0.2), ('carrot', 0.8)]
Then if you just want the common ones you can add a guard:
In []:
[(k, sum(v)/len(v)) for k, v in d.items() if len(v) > 1]
Out[]:
[('egg', 0.45), ('chocolate', 0.5)]
This extends to any number of lists and makes no assumption around the number of common elements.
Convert lists to dicts
d_list1 = dict(mylist1)
d_list2 = dict(mylist2)
[[k, (v+d_list2[k])/2] for k, v in d_list1.items() if k in d_list2]
#[['egg', 0.45], ['chocolate', 0.5]]
You get the common keys from the two lists by using the set
intersection
method and then using a list comprehension calculate the mean:
mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]
dict1 = dict(mylist1)
dict2 = dict(mylist2)
res = [[key, (dict1.get(key)+dict2.get(key))/2] for key in set(dict1.keys()).intersection(set(dict2.keys()))]
print(res)
Output:
>> [['chocolate', 0.5], ['egg', 0.45]]
You can do it in the time required for comuting set intersections which is apparently O(min(N1,N2)) where N1, N2 are the list lengths.
intersect = set([a[0] for a in mylist1]).intersection([a[0] for a in mylist2])
d1=dict(mylist1)
d2=dict(mylist2)
{i:(d1[i]+d2[i])/2 for i in intersect}
You can use the Pandas library to avoid writing any sort of loops yourself.
Your code would be really concise and clean.
Install Pandas like: pip install pandas
.
Then try this:
In [132]: import pandas as pd
In [109]: df1 = pd.DataFrame(mylist1)
In [110]: df2 = pd.DataFrame(mylist2)
In [117]: res = pd.merge(df1, df2, on=0)
In [121]: res['mean'] = res.mean(axis=1)
In [125]: res.drop(['1_x', '1_y'], 1, inplace=True)
In [131]: res.values.tolist()
Out[131]: [['egg', 0.45], ['chocolate', 0.5]]
Edit
Pandas is crazy fast because it uses numpy
under the hood. Numpy implements highly efficient array operations.
Please check the post : Why is Pandas so madly fast? for more details on calculating mean
through pure Python vs Pandas
.
Here is one solution that uses collections.defaultdict
to group the items and calculates the averages with statistics.mean
:
from collections import defaultdict
from statistics import mean
mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]
d = defaultdict(list)
for lst in (mylist1, mylist2):
for k, v in lst:
d[k].append(v)
result = [[k, mean(v)] for k, v in d.items()]
print(result)
# [['lemon', 0.1], ['egg', 0.45], ['muffin', 0.3], ['chocolate', 0.5], ['milk', 0.2], ['carrot', 0.8]]
If we only want common keys, just check if the values are more than 1:
result = [[k, mean(v)] for k, v in d.items() if len(v) > 1]
print(result)
# [['egg', 0.45], ['chocolate', 0.5]]
We could also just build the result from set intersection:
mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]
d1, d2 = dict(mylist1), dict(mylist2)
result = [[k, (d1[k] + d2[k]) / 2] for k in d1.keys() & d2.keys()]
print(result)
# [['egg', 0.45], ['chocolate', 0.5]]
Here is a simple, very Pythonic solution:
result = [[x[0], (x[1] + y[1])/2] for x in mylist1 for y in mylist2 if x[0] == y[0]]
It probably is not the fastest solution, but it is faster by virtue of using Python list comprehension to iterate the lists and, since neither this solution nor the OP’s will work with multiple instances of a list key value, it replaces the np.mean with a simple average of the two values.