Python Pandas – How can I expand month-over-month goals for markets and channels into a day-over-day one?
Question:
For a long time, I’ve maintained a report that shows progress to goals for various markets and channels, and I’ve relied on some functions in Google Sheets (most notably split and flatten) to take a monthly budget and split it out into a daily goal so it can be combined with data from another system to get daily counts, and then aggregate it in Tableau Desktop to whatever time period is needed (i.e. by week, month, year). It’s finicky to add new markets, channels, etc., and the Google Sheets have gotten too big to connect with Tableau anyway. I wanted to use Python to make things easier.
The solution I’ve been working on uses the Pandas library of Python to pull in an Excel file that has a market, channel, and KPI in each row, and has a column for each month, with the actual goal as the values. I can get it to unpivot into a more tabular view with pd.melt, but I haven’t found any solutions that allow me to expand each month into days, where each day has a fraction of the goal proportionate to the number of days in the month, while preserving the KPI, market, and channel.
df = pd.DataFrame([['New', 'Albuquerque', 'Marketing', 34, 34, 34, 35, 35, 36, 36, 36, 37, 40, 40, 40],
['New', 'Boston', 'Marketing', 12, 12, 12, 12, 12, 13, 13, 14, 14, 15, 16, 17],
['Converted', 'Albuquerque', 'Marketing', 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
['Converted', 'Boston', 'Marketing', 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]],
columns=['KPI',
'Market',
'Channel',
'2022-01-01',
'2022-02-01',
'2022-03-01',
'2022-04-01',
'2022-05-01',
'2022-06-01',
'2022-07-01',
'2022-08-01',
'2022-09-01',
'2022-10-01',
'2022-11-01',
'2022-12-01'])
# Set up variables for the melt
index_vars = ['KPI','Market','Channel']
val_vars = df.set_index(index_vars).columns.tolist()
# Unpivot months
df = pd.melt(df,
id_vars=index_vars,
value_vars=val_vars,
var_name='Date',
value_name='Goal',
ignore_index=False)
# Force dates to datetime, sort and reset index for a clean view
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df = df.sort_values(by=['KPI','Market','Channel','Date']).reset_index(drop=True)
print(df)
This gives me a view like this:
KPI Market Channel Date Goal
0 Converted Albuquerque Marketing 2022-01-01 5
1 Converted Albuquerque Marketing 2022-02-01 5
2 Converted Albuquerque Marketing 2022-03-01 5
3 Converted Albuquerque Marketing 2022-04-01 5
4 Converted Albuquerque Marketing 2022-05-01 5
5 Converted Albuquerque Marketing 2022-06-01 5
6 Converted Albuquerque Marketing 2022-07-01 5
7 Converted Albuquerque Marketing 2022-08-01 5
8 Converted Albuquerque Marketing 2022-09-01 5
9 Converted Albuquerque Marketing 2022-10-01 5
10 Converted Albuquerque Marketing 2022-11-01 5
11 Converted Albuquerque Marketing 2022-12-01 5
12 Converted Boston Marketing 2022-01-01 2
13 Converted Boston Marketing 2022-02-01 2
...
I’m trying to take this and have it spit out something like this:
KPI Market Channel Date Goal
0 Converted Albuquerque Marketing 2022-01-01 0.1612903226
1 Converted Albuquerque Marketing 2022-01-02 0.1612903226
2 Converted Albuquerque Marketing 2022-01-03 0.1612903226
3 Converted Albuquerque Marketing 2022-01-04 0.1612903226
4 Converted Albuquerque Marketing 2022-01-05 0.1612903226
5 Converted Albuquerque Marketing 2022-01-06 0.1612903226
6 Converted Albuquerque Marketing 2022-01-07 0.1612903226
7 Converted Albuquerque Marketing 2022-01-08 0.1612903226
8 Converted Albuquerque Marketing 2022-01-09 0.1612903226
9 Converted Albuquerque Marketing 2022-01-10 0.1612903226
10 Converted Albuquerque Marketing 2022-01-11 0.1612903226
11 Converted Albuquerque Marketing 2022-01-12 0.1612903226
12 Converted Boston Marketing 2022-01-13 0.064516129
13 Converted Boston Marketing 2022-01-14 0.064516129
...
Edit: To expand on where I’m stuck, I’ve seen other solutions to get the quotient by dividing the goal by pd.Period.days_in_month, so I don’t think that will end up being a problem. The problem I’m facing is that the solutions I’ve seen for expanding the months into their constituent days have only shown the solution applied to a DataFrame with a single column of datetime data as the index with non-repeating values, whereas the DataFrame I’m looking to build would have the dates repeated for each KPI/Market/Channel combination.
When I try solutions like this one:
start = '2022-01-01'
end = '2022-12-31'
dates = pd.date_range(start,end,freq='D')
df_daily = df.reindex(dates,method='ffill')
df_daily
I get a TypeError that it "Cannot compare dtypes int64 and datetime64[ns]"
When I try to convert the Date column with .dt.to_period(‘m’).to_timestamp() right after melting, i.e.:
df['Date'] = (pd.to_datetime(df['Date'], format='%Y/%m/%d')
.dt.to_period('m')
.dt.to_timestamp())
I get an error that it "Cannot compare dtypes int64 and datetime64[ns]"
I’m not sure what the error is in my approach, but I feel like I’m missing something glaringly obvious.
Answers:
It can be easier to start from the original dataframe before melt
:
# Keep only date columns and set others to index
out = df.set_index(index_vars)
out.columns = pd.to_datetime(out.columns)
# Expand months to days
new_idx = pd.date_range(out.columns.min(), out.columns.max() + pd.offsets.MonthEnd(0), freq='D')
# Compute the new goal according days in month
out /= out.columns.days_in_month
# Reindex with the days index and fill missing values
out = out.reindex(new_idx, axis=1).ffill(axis=1)
The intermediate output is:
>>> out
2022-01-01 2022-01-02 2022-01-03 ... 2022-12-29 2022-12-30 2022-12-31
KPI Market Channel ...
New Albuquerque Marketing 1.096774 1.096774 1.096774 ... 1.290323 1.290323 1.290323
Boston Marketing 0.387097 0.387097 0.387097 ... 0.548387 0.548387 0.548387
Converted Albuquerque Marketing 0.161290 0.161290 0.161290 ... 0.161290 0.161290 0.161290
Boston Marketing 0.064516 0.064516 0.064516 ... 0.064516 0.064516 0.064516
[4 rows x 365 columns]
However if you want the expected output, you can use:
out = out.rename_axis(columns='Date').stack().to_frame('Goal').reset_index()
Final output:
>>> out
KPI Market Channel Date Goal
0 New Albuquerque Marketing 2022-01-01 1.096774
1 New Albuquerque Marketing 2022-01-02 1.096774
2 New Albuquerque Marketing 2022-01-03 1.096774
3 New Albuquerque Marketing 2022-01-04 1.096774
4 New Albuquerque Marketing 2022-01-05 1.096774
... ... ... ... ... ...
1455 Converted Boston Marketing 2022-12-27 0.064516
1456 Converted Boston Marketing 2022-12-28 0.064516
1457 Converted Boston Marketing 2022-12-29 0.064516
1458 Converted Boston Marketing 2022-12-30 0.064516
1459 Converted Boston Marketing 2022-12-31 0.064516
[1460 rows x 5 columns]
For a long time, I’ve maintained a report that shows progress to goals for various markets and channels, and I’ve relied on some functions in Google Sheets (most notably split and flatten) to take a monthly budget and split it out into a daily goal so it can be combined with data from another system to get daily counts, and then aggregate it in Tableau Desktop to whatever time period is needed (i.e. by week, month, year). It’s finicky to add new markets, channels, etc., and the Google Sheets have gotten too big to connect with Tableau anyway. I wanted to use Python to make things easier.
The solution I’ve been working on uses the Pandas library of Python to pull in an Excel file that has a market, channel, and KPI in each row, and has a column for each month, with the actual goal as the values. I can get it to unpivot into a more tabular view with pd.melt, but I haven’t found any solutions that allow me to expand each month into days, where each day has a fraction of the goal proportionate to the number of days in the month, while preserving the KPI, market, and channel.
df = pd.DataFrame([['New', 'Albuquerque', 'Marketing', 34, 34, 34, 35, 35, 36, 36, 36, 37, 40, 40, 40],
['New', 'Boston', 'Marketing', 12, 12, 12, 12, 12, 13, 13, 14, 14, 15, 16, 17],
['Converted', 'Albuquerque', 'Marketing', 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
['Converted', 'Boston', 'Marketing', 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]],
columns=['KPI',
'Market',
'Channel',
'2022-01-01',
'2022-02-01',
'2022-03-01',
'2022-04-01',
'2022-05-01',
'2022-06-01',
'2022-07-01',
'2022-08-01',
'2022-09-01',
'2022-10-01',
'2022-11-01',
'2022-12-01'])
# Set up variables for the melt
index_vars = ['KPI','Market','Channel']
val_vars = df.set_index(index_vars).columns.tolist()
# Unpivot months
df = pd.melt(df,
id_vars=index_vars,
value_vars=val_vars,
var_name='Date',
value_name='Goal',
ignore_index=False)
# Force dates to datetime, sort and reset index for a clean view
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df = df.sort_values(by=['KPI','Market','Channel','Date']).reset_index(drop=True)
print(df)
This gives me a view like this:
KPI Market Channel Date Goal
0 Converted Albuquerque Marketing 2022-01-01 5
1 Converted Albuquerque Marketing 2022-02-01 5
2 Converted Albuquerque Marketing 2022-03-01 5
3 Converted Albuquerque Marketing 2022-04-01 5
4 Converted Albuquerque Marketing 2022-05-01 5
5 Converted Albuquerque Marketing 2022-06-01 5
6 Converted Albuquerque Marketing 2022-07-01 5
7 Converted Albuquerque Marketing 2022-08-01 5
8 Converted Albuquerque Marketing 2022-09-01 5
9 Converted Albuquerque Marketing 2022-10-01 5
10 Converted Albuquerque Marketing 2022-11-01 5
11 Converted Albuquerque Marketing 2022-12-01 5
12 Converted Boston Marketing 2022-01-01 2
13 Converted Boston Marketing 2022-02-01 2
...
I’m trying to take this and have it spit out something like this:
KPI Market Channel Date Goal
0 Converted Albuquerque Marketing 2022-01-01 0.1612903226
1 Converted Albuquerque Marketing 2022-01-02 0.1612903226
2 Converted Albuquerque Marketing 2022-01-03 0.1612903226
3 Converted Albuquerque Marketing 2022-01-04 0.1612903226
4 Converted Albuquerque Marketing 2022-01-05 0.1612903226
5 Converted Albuquerque Marketing 2022-01-06 0.1612903226
6 Converted Albuquerque Marketing 2022-01-07 0.1612903226
7 Converted Albuquerque Marketing 2022-01-08 0.1612903226
8 Converted Albuquerque Marketing 2022-01-09 0.1612903226
9 Converted Albuquerque Marketing 2022-01-10 0.1612903226
10 Converted Albuquerque Marketing 2022-01-11 0.1612903226
11 Converted Albuquerque Marketing 2022-01-12 0.1612903226
12 Converted Boston Marketing 2022-01-13 0.064516129
13 Converted Boston Marketing 2022-01-14 0.064516129
...
Edit: To expand on where I’m stuck, I’ve seen other solutions to get the quotient by dividing the goal by pd.Period.days_in_month, so I don’t think that will end up being a problem. The problem I’m facing is that the solutions I’ve seen for expanding the months into their constituent days have only shown the solution applied to a DataFrame with a single column of datetime data as the index with non-repeating values, whereas the DataFrame I’m looking to build would have the dates repeated for each KPI/Market/Channel combination.
When I try solutions like this one:
start = '2022-01-01'
end = '2022-12-31'
dates = pd.date_range(start,end,freq='D')
df_daily = df.reindex(dates,method='ffill')
df_daily
I get a TypeError that it "Cannot compare dtypes int64 and datetime64[ns]"
When I try to convert the Date column with .dt.to_period(‘m’).to_timestamp() right after melting, i.e.:
df['Date'] = (pd.to_datetime(df['Date'], format='%Y/%m/%d')
.dt.to_period('m')
.dt.to_timestamp())
I get an error that it "Cannot compare dtypes int64 and datetime64[ns]"
I’m not sure what the error is in my approach, but I feel like I’m missing something glaringly obvious.
It can be easier to start from the original dataframe before melt
:
# Keep only date columns and set others to index
out = df.set_index(index_vars)
out.columns = pd.to_datetime(out.columns)
# Expand months to days
new_idx = pd.date_range(out.columns.min(), out.columns.max() + pd.offsets.MonthEnd(0), freq='D')
# Compute the new goal according days in month
out /= out.columns.days_in_month
# Reindex with the days index and fill missing values
out = out.reindex(new_idx, axis=1).ffill(axis=1)
The intermediate output is:
>>> out
2022-01-01 2022-01-02 2022-01-03 ... 2022-12-29 2022-12-30 2022-12-31
KPI Market Channel ...
New Albuquerque Marketing 1.096774 1.096774 1.096774 ... 1.290323 1.290323 1.290323
Boston Marketing 0.387097 0.387097 0.387097 ... 0.548387 0.548387 0.548387
Converted Albuquerque Marketing 0.161290 0.161290 0.161290 ... 0.161290 0.161290 0.161290
Boston Marketing 0.064516 0.064516 0.064516 ... 0.064516 0.064516 0.064516
[4 rows x 365 columns]
However if you want the expected output, you can use:
out = out.rename_axis(columns='Date').stack().to_frame('Goal').reset_index()
Final output:
>>> out
KPI Market Channel Date Goal
0 New Albuquerque Marketing 2022-01-01 1.096774
1 New Albuquerque Marketing 2022-01-02 1.096774
2 New Albuquerque Marketing 2022-01-03 1.096774
3 New Albuquerque Marketing 2022-01-04 1.096774
4 New Albuquerque Marketing 2022-01-05 1.096774
... ... ... ... ... ...
1455 Converted Boston Marketing 2022-12-27 0.064516
1456 Converted Boston Marketing 2022-12-28 0.064516
1457 Converted Boston Marketing 2022-12-29 0.064516
1458 Converted Boston Marketing 2022-12-30 0.064516
1459 Converted Boston Marketing 2022-12-31 0.064516
[1460 rows x 5 columns]