Pandas groupby & transpose dataframe while keeping original columns
Question:
I have a dataframe:
df =
ID WorkAddress City Lat Long Department
1 0001 123_lane City1 17.4 78.3 Audit
2 0002 123_lane City1 17.4 78.3 Lending
3 0003 111_lane City2 19.6 64.2 Finance
4 0004 112_lane City3 18.4 89.9 Legal
5 0005 112_lane City3 18.4 89.9 Legal
I transformed it to get a count of each ID by distinct WorkAddress, for each Department:
dfDeptCounts = df.assign(flag=df.groupby('WorkAddress').Department.cumcount())
.pivot_table(index='WorkAddress', columns=['Department'], values='ID', aggfunc='count').reset_index()
dfDeptCounts =
WorkAddress Audit Lending Finance Legal
1 123_lane 1 1 0 0
2 111_lane 0 0 1 0
3 112_lane 0 0 0 2
Any attempt I make to include City, Lat, Long results in an error whether, by adding it as an additional groupby, or trying to reset the index. Is there a multi-indexing level that I’m missing, or would there be a better way to transform the df to include all columns?
Edit
I apologize, I might not have been clear in my question. This is the end goal:
dfDeptCounts =
WorkAddress City Lat Long Audit Lending Finance Legal
1 123_lane City1 17.4 78.3 1 1 0 0
2 111_lane City2 19.6 64.2 0 0 1 0
3 112_lane City3 18.4 89.9 0 0 0 2
Answers:
To go a bit beyond @Psidom’s answer as a comment. You can use pandas.crosstab
in combination with categorical data:
df['Department'] = pd.Categorical(df['Department'],
categories=['Audit', 'Lending', 'Finance',
'HR', 'Legal']
)
df2 = pd.crosstab(df.WorkAddress, df.Department, dropna=False)
The use of categorical data will ensure that even missing or empty categories (here "HR") will be represented in the final crosstab. For this you need to add the dropna=False
parameter.
output:
>>> df2
Department Audit Lending Finance HR Legal
WorkAddress
111_lane 0 0 1 0 0
112_lane 0 0 0 0 2
123_lane 1 1 0 0 0
Now if you want to add the other information, you first need to chose which rows to drop (here it does not matter as the information is the same, so we keep the first one), and we merge it with the previous output:
(df.drop_duplicates(subset=['WorkAddress'])
.drop('ID', axis=1)
.merge(df2,
left_on='WorkAddress',
right_index=True)
)
output:
WorkAddress City Lat Long Department Audit Lending Finance HR Legal
1 123_lane City1 17.4 78.3 Audit 1 1 0 0 0
3 111_lane City2 19.6 64.2 Finance 0 0 1 0 0
4 112_lane City2 18.4 89.9 Legal 0 0 0 0 2
use pivot_table and aggfunc
df1.assign(col1=1).pivot_table(index=['WorkAddress','City','Lat','Long'],columns='Department',values='col1',aggfunc=sum,fill_value=0).reset_index().rename_axis(None,axis=1)
out:
WorkAddress City Lat Long Audit Finance Legal Lending
0 111_lane City2 19.6 64.2 0 1 0 0
1 112_lane City3 18.4 89.9 0 0 2 0
2 123_lane City1 17.4 78.3 1 0 0 1
I have a dataframe:
df =
ID WorkAddress City Lat Long Department
1 0001 123_lane City1 17.4 78.3 Audit
2 0002 123_lane City1 17.4 78.3 Lending
3 0003 111_lane City2 19.6 64.2 Finance
4 0004 112_lane City3 18.4 89.9 Legal
5 0005 112_lane City3 18.4 89.9 Legal
I transformed it to get a count of each ID by distinct WorkAddress, for each Department:
dfDeptCounts = df.assign(flag=df.groupby('WorkAddress').Department.cumcount())
.pivot_table(index='WorkAddress', columns=['Department'], values='ID', aggfunc='count').reset_index()
dfDeptCounts =
WorkAddress Audit Lending Finance Legal
1 123_lane 1 1 0 0
2 111_lane 0 0 1 0
3 112_lane 0 0 0 2
Any attempt I make to include City, Lat, Long results in an error whether, by adding it as an additional groupby, or trying to reset the index. Is there a multi-indexing level that I’m missing, or would there be a better way to transform the df to include all columns?
Edit
I apologize, I might not have been clear in my question. This is the end goal:
dfDeptCounts =
WorkAddress City Lat Long Audit Lending Finance Legal
1 123_lane City1 17.4 78.3 1 1 0 0
2 111_lane City2 19.6 64.2 0 0 1 0
3 112_lane City3 18.4 89.9 0 0 0 2
To go a bit beyond @Psidom’s answer as a comment. You can use pandas.crosstab
in combination with categorical data:
df['Department'] = pd.Categorical(df['Department'],
categories=['Audit', 'Lending', 'Finance',
'HR', 'Legal']
)
df2 = pd.crosstab(df.WorkAddress, df.Department, dropna=False)
The use of categorical data will ensure that even missing or empty categories (here "HR") will be represented in the final crosstab. For this you need to add the dropna=False
parameter.
output:
>>> df2
Department Audit Lending Finance HR Legal
WorkAddress
111_lane 0 0 1 0 0
112_lane 0 0 0 0 2
123_lane 1 1 0 0 0
Now if you want to add the other information, you first need to chose which rows to drop (here it does not matter as the information is the same, so we keep the first one), and we merge it with the previous output:
(df.drop_duplicates(subset=['WorkAddress'])
.drop('ID', axis=1)
.merge(df2,
left_on='WorkAddress',
right_index=True)
)
output:
WorkAddress City Lat Long Department Audit Lending Finance HR Legal
1 123_lane City1 17.4 78.3 Audit 1 1 0 0 0
3 111_lane City2 19.6 64.2 Finance 0 0 1 0 0
4 112_lane City2 18.4 89.9 Legal 0 0 0 0 2
use pivot_table and aggfunc
df1.assign(col1=1).pivot_table(index=['WorkAddress','City','Lat','Long'],columns='Department',values='col1',aggfunc=sum,fill_value=0).reset_index().rename_axis(None,axis=1)
out:
WorkAddress City Lat Long Audit Finance Legal Lending
0 111_lane City2 19.6 64.2 0 1 0 0
1 112_lane City3 18.4 89.9 0 0 2 0
2 123_lane City1 17.4 78.3 1 0 0 1