Pandas: combination with highest coverage
Question:
In a supermarket I selected 30 products on which we want to run an analysis. I want to see which 12 of them give me the widest coverage of clients (on a specific date, no time involved).
This means that I have 30!/(12!(30-12)!) = 86493225 combinations of products
My pandas dataframe of clients purchases:
Client
Product
A
Banana
B
Apple
B
Banana
C
Water
…
now I could iterate to see the combination of highest client count, creating them all first with itertools
comb = set(itertools.combinations([banana,...], 12))
d = {}
for i in comb:
d[i] = df[df.product.isin(i)].Client.nunique()
but this will take a spectacular amount of time.
do you guys see any better way to count this out?
please note that I do not want to find the combination of 12 most common products, but the combination that will yield the most clients (possibly the same but not necessarily, as 2 products not individually common may yield more clients if the 2 groups don’t overlap much).
any thoughts?
thank you
Answers:
This is the maximum coverage problem and it is NP-hard. That link shows a greedy algorithm which provides an approximate solution: at each step add the product with the most new clients.
In a supermarket I selected 30 products on which we want to run an analysis. I want to see which 12 of them give me the widest coverage of clients (on a specific date, no time involved).
This means that I have 30!/(12!(30-12)!) = 86493225 combinations of products
My pandas dataframe of clients purchases:
Client | Product |
---|---|
A | Banana |
B | Apple |
B | Banana |
C | Water |
… |
now I could iterate to see the combination of highest client count, creating them all first with itertools
comb = set(itertools.combinations([banana,...], 12))
d = {}
for i in comb:
d[i] = df[df.product.isin(i)].Client.nunique()
but this will take a spectacular amount of time.
do you guys see any better way to count this out?
please note that I do not want to find the combination of 12 most common products, but the combination that will yield the most clients (possibly the same but not necessarily, as 2 products not individually common may yield more clients if the 2 groups don’t overlap much).
any thoughts?
thank you
This is the maximum coverage problem and it is NP-hard. That link shows a greedy algorithm which provides an approximate solution: at each step add the product with the most new clients.