Finding what is most commonly purchased with a specific product – PANDAS

Question:

I am trying to find the most common products purchased with the product ‘SWE’ but am currently stuck. I have the variables ‘product’ and ‘sales_invoice’

So far I have this code:

df_pairs = df_pairs = df.groupby(df['product'].str.contains('SWE'))['sales_invoice']
print(df_pairs.head())

it results in a list of sales invoices containing SWE but does not list the remaining info of the sales invoice(specifically the other products listed on that sales invoice which is what I’m looking for)

Any ideas?

EDIT: it’s not printing all of the columns values – e.g. I want it to have sales invoice ’22’ for example which shows ‘SWE’ as specified, along with the other products it has on that invoice which are ‘EGG’ and ‘BIS’

example of df:

enter image description here

Asked By: Poppy

||

Answers:

First of all, since there may be more than one product in an order, we create a line for each product, then we group them according to sales_invoice and assign the products in each order to a list, then we show the distribution of each list as a dictionary. In addition, you can filter by SWE or see the product distribution regardless of the order (last example)

example data:

    Date        sales_invoice   product quantity    product_amount
0   02-01-2016  1               SWE     1           3.13
1   02-01-2016  2               SWE     1           3.13
2   02-01-2016  2               EGG     2           5.46
3   02-01-2016  2               EGG     1           5.46
4   02-01-2016  2               EGG     1           5.46
5   02-01-2016  3               0H2     1           5.58

#when quantity > 1, if product_amount=quantity * product_price use this.
#import numpy as np
#df['product_amount']=np.where(df['quantity'] > 1,df['product_amount'] / df['quantity'],df['product_amount'])

df = df.loc[df.index.repeat(df['quantity'])].drop(['quantity'],axis=1) #append a new row for every product
df_pairs= df.groupby('sales_invoice').agg({'sales_invoice':'count','product':list,'product_Amount':'sum'})

def get_dists(x):
    counts = dict()
    for i in x:
      counts[i] = counts.get(i, 0) + 1
      
    return dict(sorted(counts.items(), key=lambda item: item[1],reverse=True))
      
df_pairs['product_distribution']=df_pairs['product'].apply(lambda x: get_dists(x))
df_pairs=df_pairs.rename(columns={'sales_invoice':'product_count','product_Amount':'total_amount'})
print(df_pairs)
'''
sales_invoice   product_count   product                          total_amount product_distribution
1               1               ['SWE']                              3.13     {'SWE': 1}
2               5               ['SWE', 'EGG', 'EGG', 'EGG', 'EGG']  24.97    {'EGG': 4, 'SWE': 1}
3               1               ['0H2']                              5.58     {'0H2': 1}

'''

now, if you want to get only values that contain "SWE":

mask = df_pairs['product'].apply(lambda x: 'SWE' in x)
df_filtered = df_pairs[mask]

or if you need only distributions and get top n products:

df_filtered=pd.json_normalize(df_filtered['product_distribution']).T.fillna(0)
df_filtered['total']=df_filtered[0] + df_filtered[1]
df_filtered=df_filtered.sort_values(by='total',ascending=False)
#The total column will give you what you want.
Answered By: Clegane