pandas resample dataframe to sum sales daily by another column of CustomerID
Question:
I have a pandas dataframe with datetime (TransactionDate) column and a CustomerID column and a Sales column. I want to resample the data Daily to sum the Sales daily but for each CustomerID separately. I tried two different ways to do it but both are not generating the desired results.
When I try to do it, by setting only the TransactionDate column as the index, the Sales sums up but so does the CustomerID column and I lose the information about which CustomerID is generating how much sales.
When I try to do it by setting both the TransactionDate column and the CustomerID column as index, I get the error
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'MultiIndex'
How do I do it so that I can get a dataframe of daily sales by CustomerID?
The code with entire data is below:
import pandas as pd
import numpy as np
import random
random.seed(30)
np.random.seed(30)
InvoiceNo = range(10000,10500)
print('len(InvoiceNo)',len(InvoiceNo))
start_date,end_date = '1/1/2015','12/31/2019'
date_rng = pd.date_range(start= start_date, periods=len(InvoiceNo), freq='3H')
length_of_field = date_rng.shape[0]
df = pd.DataFrame(date_rng, columns=['TransactionDate'])
df['InvoiceNo']=InvoiceNo
df['Quantity'] = np.random.randint(18,100,size=(len(date_rng)))
Items = ('ItemA','ItemB','ItemC','ItemD')
group_1 = np.random.choice(Items, len(InvoiceNo), p = [0.3, 0.5, 0.15, 0.05])
Price = (10.0,20,30,40)
dict_item_price = dict(zip(Items,Price))
PriceList = [dict_item_price[i] for i in group_1]
CustomerID = (18750,18751,18752,18753,18754,18756,18757)
group_2 = np.random.choice(CustomerID, len(InvoiceNo), p = [0.10, 0.25, 0.15, 0.05,0.35,0.05,0.05])
df['ItemCode'] = group_1
df['Price'] = PriceList
df['CustomerID'] = group_2
df['CustomerID'].astype(str)
df['Sales']=df['Price']*df['Quantity']
print('ndf:')
print(df)
print(df.dtypes)
df1 = df[['CustomerID','Sales','TransactionDate']].copy().set_index(['TransactionDate'])
print('n df1 :')
print(df1)
total_sales = df['Sales'].sum()
print('ntotal sales :',total_sales)
daily_sales = df1.resample('D').sum()
print('n daily_sales :')
print(daily_sales)
Answers:
Something like:
df.groupby(['CustomerID', df['TransactionDate'].dt.normalize()])['Sales'].sum()
Or
df.groupby(['CustomerID', df['TransactionDate'].dt.to_period('D')])['Sales'].sum()
I have a pandas dataframe with datetime (TransactionDate) column and a CustomerID column and a Sales column. I want to resample the data Daily to sum the Sales daily but for each CustomerID separately. I tried two different ways to do it but both are not generating the desired results.
When I try to do it, by setting only the TransactionDate column as the index, the Sales sums up but so does the CustomerID column and I lose the information about which CustomerID is generating how much sales.
When I try to do it by setting both the TransactionDate column and the CustomerID column as index, I get the error
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'MultiIndex'
How do I do it so that I can get a dataframe of daily sales by CustomerID?
The code with entire data is below:
import pandas as pd
import numpy as np
import random
random.seed(30)
np.random.seed(30)
InvoiceNo = range(10000,10500)
print('len(InvoiceNo)',len(InvoiceNo))
start_date,end_date = '1/1/2015','12/31/2019'
date_rng = pd.date_range(start= start_date, periods=len(InvoiceNo), freq='3H')
length_of_field = date_rng.shape[0]
df = pd.DataFrame(date_rng, columns=['TransactionDate'])
df['InvoiceNo']=InvoiceNo
df['Quantity'] = np.random.randint(18,100,size=(len(date_rng)))
Items = ('ItemA','ItemB','ItemC','ItemD')
group_1 = np.random.choice(Items, len(InvoiceNo), p = [0.3, 0.5, 0.15, 0.05])
Price = (10.0,20,30,40)
dict_item_price = dict(zip(Items,Price))
PriceList = [dict_item_price[i] for i in group_1]
CustomerID = (18750,18751,18752,18753,18754,18756,18757)
group_2 = np.random.choice(CustomerID, len(InvoiceNo), p = [0.10, 0.25, 0.15, 0.05,0.35,0.05,0.05])
df['ItemCode'] = group_1
df['Price'] = PriceList
df['CustomerID'] = group_2
df['CustomerID'].astype(str)
df['Sales']=df['Price']*df['Quantity']
print('ndf:')
print(df)
print(df.dtypes)
df1 = df[['CustomerID','Sales','TransactionDate']].copy().set_index(['TransactionDate'])
print('n df1 :')
print(df1)
total_sales = df['Sales'].sum()
print('ntotal sales :',total_sales)
daily_sales = df1.resample('D').sum()
print('n daily_sales :')
print(daily_sales)
Something like:
df.groupby(['CustomerID', df['TransactionDate'].dt.normalize()])['Sales'].sum()
Or
df.groupby(['CustomerID', df['TransactionDate'].dt.to_period('D')])['Sales'].sum()