Finding most common elements from large amount of lists

Question:

I have a lot of lists, and I want to find group of most common elements that appear in the lists.

For example:

l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']

How to find group of elements that appear in these lists, for example:

Group 1: ['apple', 'banana', 'orange'] in l1, l2, appear 2 times
Group 2: ['apple', 'grape', 'kiwi'] in l4, l5, appear 2 times
Group 3: ['apple', 'grape', 'orange'] in l2, l5, appear 2 times

Index of the element is not important. Group should have minimum 3 and maximum 5 elements.
Lists can have from 3 to 10 elements.

I know that I can do something like this with intersections, but what if I have totally different list:

l7 = ["x", "y", "z", "k"]

Elements from this list are not appearing in any other list

Asked By: taga

||

Answers:

I think this might help, (I gave it a degree of freedom on the number of combinations to to extract and the minimum number of elements in each given combination :

import itertools

def get_max_rep(all_lists, min_comb_len, num_el_displayed):
    """Returns 'num_el_displayed' number of elements that repeat the most in the given lists"""
    # extract all the unique values of all the lists
    all_elements = set([el for l in all_lists for el in l])

    # build all the possible combinations starting from 'min_comb_len' number of elements
    combinations = [
        el for r in range(min(min_comb_len, len(all_elements)-1),len(all_elements)) 
        for el in itertools.combinations(all_elements, r)
        ]

    # count the number of repetitions of each combination in the given lists
    out = sorted(
        [(comb, sum([all(fruit in el  for fruit in comb) for el in all_lists])) 
        for comb in combinations], key=lambda x:x[1], reverse=True
        )[:num_el_displayed]

    return out

To test it out (here I want the first 5 combinations that have the most repetitions and that have a minimum of 2 elements:

# testing ...
l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']

all_lists = [l1, l2, l3, l4, l5, l6]

print(get_max_rep(all_lists, 2, 5))

output:

[(('grape', 'kiwi'), 3), (('grape', 'apple'), 3), (('orange', 'apple'), 3), (('banana', 'grape'), 2), (('banana', 'orange'), 2)]
Answered By: mrCopiCat

The problem you are facing is called Frequent Itemsets Mining. The 2 most popular algorithms (at least that I know) that solve this problem are:

They are very well explained in the book Data Mining. Concepts and Techniques (3rd Edition). A library that implements apriori is apyori, and this is how you could use it:

from apyori import apriori

transactions = [['apple', 'banana', 'orange'],
    ['apple', 'banana', 'grape', 'lemon', 'orange'],
    ['banana', 'grape', 'kiwi'],
    ['apple', 'grape', 'kiwi', 'peach'],
    ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear'],
    ['chery', 'kiwi', 'pear']]

results = list(map(lambda x: list(x.items), 
        filter(lambda x: len(x.items) >= 3, 
            apriori(transactions, min_support=2/len(transactions), max_length=5)
        )
    ))

print(results)

It returns:

[['banana', 'apple', 'orange'], ['grape', 'kiwi', 'apple'], ['grape', 'apple', 'orange']]

About the arguments of the apriori call:

  • min_support is the minimum frequency a certain itemset must have. In your case, it is 2/len(transactions) because the itemset must be present in at least 2 of the transactions
  • max_length is the maximum length of an itemset
  • I had to filter the result because otherwise I would get also the itemsets whose length is less than 3. (It is weird that there is no argument for the min length too, maybe I did not find it but the help(apyori.apriori) does not mention anything like that).
Answered By: Marco Luzzara

You could do something like this using a powerset of the combinations, obviously excluding the ones that are not in the range 3-5 lists.

Note: For the powerset function i give credit to this other answer from martineu:

Using itertools.chain.from_iterable and itertools.combinations to get the combinations.

 itertools.chain.from_iterable(iterable)

Alternate constructor for chain(). Gets chained inputs from a single iterable argument that is evaluated lazily

and using functools.reduce to do some workaround for the intersections, I have come up with this code that think that might help:

functools.reduce(function, iterable[, initializer])

from the docs:

Apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value.

from itertools import chain, combinations
from functools import reduce
from pprint import pprint

lsts = [['apple', 'banana', 'orange'],
        ['apple', 'banana', 'grape', 'lemon', 'orange'],
        ['banana', 'grape', 'kiwi'],
        ['apple', 'grape', 'kiwi', 'peach'],
        ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear'],
        ['chery', 'kiwi', 'pear']]

lsts = tuple(map(set, lsts))


def powerset(iterable, start=0, stop=None): #from answer https://stackoverflow.com/a/40986475/13285707
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)  # allows duplicate elements
    if stop is None:
        stop = len(s)+1
    return chain.from_iterable(combinations(s, r) for r in range(start, stop))


groups = []
for x in powerset(lsts, 3, 5):
    if 3 < len(c:=reduce(set.intersection,x)) < 5:
        groups.append(c)


pprint(groups)

Output:

[{'apple', 'orange'}, {'apple', 'grape'}, {'grape', 'kiwi'}]
Answered By: XxJames07-

My solutions:

l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']

# return same elements of two lists
def sub_eq_lists(l1, l2):
    return list(set([x for x in l1 + l3 if x in l1 and x in  l2]))

all_lists = [l1, l2, l3, l4, l5, l6]
results = []
for j in all_lists:
    for l in all_lists:
        if(l!=j):
            same_elements = sub_eq_lists(j,l)
            if (len(same_elements) > 2) and (len(same_elements) < 11):# your conditions
   
                results.append(same_elements)
# count occurrences
results = {str(item): results.count(item) for item in results}
results

Output:

 {"['apple', 'orange', 'banana']": 2,
 "['apple', 'orange', 'grape']": 2,
 "['apple', 'kiwi', 'grape']": 2}
Answered By: Shahab Rahnama

You can use A.issubset(B) in order to check if elements of A are present in B or not, it will simply return False.


    def count1(grp=None, list_num=None):
        return grp.issubset(list_num)
        
    if __name__ == "__main__":
        g1 = {'apple', 'banana', 'orange'}
        g2 = {'apple', 'grape', 'kiwi'}
        g3 = {'apple', 'grape', 'orange'}
        
        l1 = ['apple', 'banana', 'orange']
        l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
        l3 = ['banana', 'grape', 'kiwi']
        l4 = ['apple', 'grape', 'kiwi', 'peach']
        l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
        l6 = ['chery', 'kiwi', 'pear']
        
        group_count = {}
        groups = [g1, g2, g3]
        list_check = [l1, l2, l3, l4, l5, l6]
        
        for index_list, list_num in enumerate(list_check):
            for index, grp in enumerate(groups):
                    if count1(grp, list_num):
                        group_count.setdefault(f'g{index + 1}',[]).append(f'l{index_list + 1}')
                        
        print(group_count)

Output :
This what i understood from question.

{'g1': ['l1', 'l2'], 'g3': ['l2', 'l5'], 'g2': ['l4', 'l5']}
Answered By: Nerdy Ayubi
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.