How to count the total sales by year, month
Question:
I have a big csv (17985 rows) with sales in different days.The csv looks like this:
Customer Date Sale
Larry 1/2/2018 20$
Mike 4/3/2020 40$
John 12/5/2017 10$
Sara 3/2/2020 90$
Charles 9/8/2022 75$
Below is how many times that exact day appears in my csv (how many sales were made that day):
occur = df.groupby(['Date']).size()
occur
2018-01-02 32
2018-01-03 31
2018-01-04 42
2018-01-05 192
2018-01-06 26
I used crosstab, groupby and several methods but the problem is that they don’t add up, or is NaN.
new_df['total_sales_that_month'] = df.groupby('Date')['Sale'].sum()
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
17980 NaN
17981 NaN
17982 NaN
17983 NaN
17984 NaN
I want to group them by year and month in a dataframe, based on total sales. Using dt.year and dt.month I managed to do this:
year
month
1 2020
1 2020
7 2019
8 2019
2 2018
... ...
4 2020
4 2020
4 2020
4 2020
4 2020
What I want to have is: month/year/total_sales_that_month. What method should I apply? This is the expected output:
Month Year Total_sale_that_month
1 2018 420$
2 2018 521$
3 2018 124$
4 2018 412$
5 2018 745$
Answers:
You can use groupby_sum
but before you have to strip ‘$’ from Sale
column and convert as numeric:
# Clean your dataframe first
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['Sale'] = df['Sale'].str.strip('$').astype(float)
out = (df.groupby([df['Date'].dt.month.rename('Month'),
df['Date'].dt.year.rename('Year')])
['Sale'].sum()
.rename('Total_sale_that_month')
# .astype(str).add('$') # uncomment if '$' matters
.reset_index())
Output:
>>> out
Month Year Total_sale_that_month
0 2 2018 20.0
1 2 2020 90.0
2 3 2020 40.0
3 5 2017 10.0
4 8 2022 75.0
i share you my code,
pivot_table, reset_index and sorting,
convert your col name:
df["Dt_Customer_Y"] = pd.DatetimeIndex(df['Dt_Customer']).year
df["Dt_Customer_M"] = pd.DatetimeIndex(df['Dt_Customer']).month
pvtt = df.pivot_table(index=['Dt_Customer_Y', 'Dt_Customer_M'], aggfunc={'Income':sum})
pvtt.reset_index().sort_values(['Dt_Customer_Y', 'Dt_Customer_M'])
Dt_Customer_Y Dt_Customer_M Income
0 2012 1 856039.0
1 2012 2 487497.0
2 2012 3 921940.0
3 2012 4 881203.0
I have a big csv (17985 rows) with sales in different days.The csv looks like this:
Customer Date Sale
Larry 1/2/2018 20$
Mike 4/3/2020 40$
John 12/5/2017 10$
Sara 3/2/2020 90$
Charles 9/8/2022 75$
Below is how many times that exact day appears in my csv (how many sales were made that day):
occur = df.groupby(['Date']).size()
occur
2018-01-02 32
2018-01-03 31
2018-01-04 42
2018-01-05 192
2018-01-06 26
I used crosstab, groupby and several methods but the problem is that they don’t add up, or is NaN.
new_df['total_sales_that_month'] = df.groupby('Date')['Sale'].sum()
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
17980 NaN
17981 NaN
17982 NaN
17983 NaN
17984 NaN
I want to group them by year and month in a dataframe, based on total sales. Using dt.year and dt.month I managed to do this:
year
month
1 2020
1 2020
7 2019
8 2019
2 2018
... ...
4 2020
4 2020
4 2020
4 2020
4 2020
What I want to have is: month/year/total_sales_that_month. What method should I apply? This is the expected output:
Month Year Total_sale_that_month
1 2018 420$
2 2018 521$
3 2018 124$
4 2018 412$
5 2018 745$
You can use groupby_sum
but before you have to strip ‘$’ from Sale
column and convert as numeric:
# Clean your dataframe first
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['Sale'] = df['Sale'].str.strip('$').astype(float)
out = (df.groupby([df['Date'].dt.month.rename('Month'),
df['Date'].dt.year.rename('Year')])
['Sale'].sum()
.rename('Total_sale_that_month')
# .astype(str).add('$') # uncomment if '$' matters
.reset_index())
Output:
>>> out
Month Year Total_sale_that_month
0 2 2018 20.0
1 2 2020 90.0
2 3 2020 40.0
3 5 2017 10.0
4 8 2022 75.0
i share you my code,
pivot_table, reset_index and sorting,
convert your col name:
df["Dt_Customer_Y"] = pd.DatetimeIndex(df['Dt_Customer']).year
df["Dt_Customer_M"] = pd.DatetimeIndex(df['Dt_Customer']).month
pvtt = df.pivot_table(index=['Dt_Customer_Y', 'Dt_Customer_M'], aggfunc={'Income':sum})
pvtt.reset_index().sort_values(['Dt_Customer_Y', 'Dt_Customer_M'])
Dt_Customer_Y Dt_Customer_M Income
0 2012 1 856039.0
1 2012 2 487497.0
2 2012 3 921940.0
3 2012 4 881203.0