Pandas: combination with highest coverage

Question

In a supermarket I selected 30 products on which we want to run an analysis. I want to see which 12 of them give me the widest coverage of clients (on a specific date, no time involved).

This means that I have 30!/(12!(30-12)!) = 86493225 combinations of products

My pandas dataframe of clients purchases:

Client	Product
A	Banana
B	Apple
B	Banana
C	Water
…

now I could iterate to see the combination of highest client count, creating them all first with itertools

comb = set(itertools.combinations([banana,...], 12))
d = {}
for i in comb:
    d[i] = df[df.product.isin(i)].Client.nunique()

but this will take a spectacular amount of time.

do you guys see any better way to count this out?

please note that I do not want to find the combination of 12 most common products, but the combination that will yield the most clients (possibly the same but not necessarily, as 2 products not individually common may yield more clients if the 2 groups don’t overlap much).

any thoughts?

thank you

Asked By: lorenzo

||

Source

Answer 1

This is the maximum coverage problem and it is NP-hard. That link shows a greedy algorithm which provides an approximate solution: at each step add the product with the most new clients.

Answered By: Jeff

Pandas: combination with highest coverage

Question:

Answers: