Add a Row if a Specific ID doesn't have it for the Pre or Post period with zeros in the Missing Columns
Question:
I have a dataframe that is the net amount a person has spent on services in the pre period and in the post period which was given to me. We are looking to do an analysis to compare if these members had different spend and visits in the pre period compared to the post period.
The dataframe looks like this but this problem presents itself throughout the data in several spots and sometimes it is the "Pre" period that is missing for the member and sometimes it is the "Post" period.
df=pd.DataFrame({'unique_member_id_key':[723543, 723543, 723548, 723548, 723550, 723552, 723552],'net_amount':[34.26,35.09,72.07,54.73,54.32,87.43,87.32],'total_visits':[4,2,8,1,3,5,4],'Period':["Pre","Post","Pre","Post","Pre","Pre","Post"]})
What I want to do is fix this in python such that the pandas dataframe will fill in the missing "Pre" or "Post" periods with a new row for that member that puts zeros in for the "total_visits" and "net_amount" columns and adds the "Pre" or "Post" value for the Period column (depending on if it is missing a row for "Pre" period values or "Post" period values).
Is there a way to systematically do this without having to find each ID that is missing a "Pre" or "Post" period and inserting the row individually for each time this occurs?
Thanks!!
Mark
Answers:
IIUC, you can use pivot_table
to get the dense matrix then stack
to get your original dataframe:
>>> (df.pivot_table(index='unique_member_id_key', columns='Period',
values=['net_amount', 'total_visits'], fill_value=0)
.stack().reset_index())
unique_member_id_key Period net_amount total_visits
0 723543 Post 35.09 2
1 723543 Pre 34.26 4
2 723548 Post 54.73 1
3 723548 Pre 72.07 8
4 723550 Post 0.00 0 # <- HERE
5 723550 Pre 54.32 3
6 723552 Post 87.32 4
7 723552 Pre 87.43 5
Or suggested by @mozway with set_index/unstack
then stack/reset_index
:
>>> (df.set_index(['unique_member_id_key', 'Period'])
.unstack(fill_value=0)
.stack().reset_index())
unique_member_id_key Period net_amount total_visits
0 723543 Post 35.09 2
1 723543 Pre 34.26 4
2 723548 Post 54.73 1
3 723548 Pre 72.07 8
4 723550 Post 0.00 0 # <- HERE
5 723550 Pre 54.32 3
6 723552 Post 87.32 4
7 723552 Pre 87.43 5
I have a dataframe that is the net amount a person has spent on services in the pre period and in the post period which was given to me. We are looking to do an analysis to compare if these members had different spend and visits in the pre period compared to the post period.
The dataframe looks like this but this problem presents itself throughout the data in several spots and sometimes it is the "Pre" period that is missing for the member and sometimes it is the "Post" period.
df=pd.DataFrame({'unique_member_id_key':[723543, 723543, 723548, 723548, 723550, 723552, 723552],'net_amount':[34.26,35.09,72.07,54.73,54.32,87.43,87.32],'total_visits':[4,2,8,1,3,5,4],'Period':["Pre","Post","Pre","Post","Pre","Pre","Post"]})
What I want to do is fix this in python such that the pandas dataframe will fill in the missing "Pre" or "Post" periods with a new row for that member that puts zeros in for the "total_visits" and "net_amount" columns and adds the "Pre" or "Post" value for the Period column (depending on if it is missing a row for "Pre" period values or "Post" period values).
Is there a way to systematically do this without having to find each ID that is missing a "Pre" or "Post" period and inserting the row individually for each time this occurs?
Thanks!!
Mark
IIUC, you can use pivot_table
to get the dense matrix then stack
to get your original dataframe:
>>> (df.pivot_table(index='unique_member_id_key', columns='Period',
values=['net_amount', 'total_visits'], fill_value=0)
.stack().reset_index())
unique_member_id_key Period net_amount total_visits
0 723543 Post 35.09 2
1 723543 Pre 34.26 4
2 723548 Post 54.73 1
3 723548 Pre 72.07 8
4 723550 Post 0.00 0 # <- HERE
5 723550 Pre 54.32 3
6 723552 Post 87.32 4
7 723552 Pre 87.43 5
Or suggested by @mozway with set_index/unstack
then stack/reset_index
:
>>> (df.set_index(['unique_member_id_key', 'Period'])
.unstack(fill_value=0)
.stack().reset_index())
unique_member_id_key Period net_amount total_visits
0 723543 Post 35.09 2
1 723543 Pre 34.26 4
2 723548 Post 54.73 1
3 723548 Pre 72.07 8
4 723550 Post 0.00 0 # <- HERE
5 723550 Pre 54.32 3
6 723552 Post 87.32 4
7 723552 Pre 87.43 5