Pandas: combination with highest coverage

Question:

In a supermarket I selected 30 products on which we want to run an analysis. I want to see which 12 of them give me the widest coverage of clients (on a specific date, no time involved).

This means that I have 30!/(12!(30-12)!) = 86493225 combinations of products

My pandas dataframe of clients purchases:

Client Product
A Banana
B Apple
B Banana
C Water

now I could iterate to see the combination of highest client count, creating them all first with itertools

comb = set(itertools.combinations([banana,...], 12))
d = {}
for i in comb:
    d[i] = df[df.product.isin(i)].Client.nunique()

but this will take a spectacular amount of time.

do you guys see any better way to count this out?

please note that I do not want to find the combination of 12 most common products, but the combination that will yield the most clients (possibly the same but not necessarily, as 2 products not individually common may yield more clients if the 2 groups don’t overlap much).

any thoughts?

thank you

Asked By: lorenzo

||

Answers:

This is the maximum coverage problem and it is NP-hard. That link shows a greedy algorithm which provides an approximate solution: at each step add the product with the most new clients.

Answered By: Jeff
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.