Pandas – GroupBy and then Merge on original table
Question:
I’m trying to write a function to aggregate and perform various stats calcuations on a dataframe in Pandas and then merge it to the original dataframe however, I’m running to issues. This is code equivalent in SQL:
SELECT EID,
PCODE,
SUM(PVALUE) AS PVALUE,
SUM(SQRT(SC*EXP(SC-1))) AS SC,
SUM(SI) AS SI,
SUM(EE) AS EE
INTO foo_bar_grp
FROM foo_bar
GROUP BY EID, PCODE
And then join on the original table:
SELECT *
FROM foo_bar_grp INNER JOIN
foo_bar ON foo_bar.EID = foo_bar_grp.EID
AND foo_bar.PCODE = foo_bar_grp.PCODE
Here are the steps: Loading the data
IN:>>
pol_dict = {'PID':[1,1,2,2],
'EID':[123,123,123,123],
'PCODE':['GU','GR','GU','GR'],
'PVALUE':[100,50,150,300],
'SI':[400,40,140,140],
'SC':[230,23,213,213],
'EE':[10000,10000,2000,30000],
}
pol_df = DataFrame(pol_dict)
pol_df
OUT:>>
EID EE PCODE PID PVALUE SC SI
0 123 10000 GU 1 100 230 400
1 123 10000 GR 1 50 23 40
2 123 2000 GU 2 150 213 140
3 123 30000 GR 2 300 213 140
Step 2: Calculating and Grouping on the data:
My pandas code is as follows:
#create aggregation dataframe
poagg_df = pol_df
del poagg_df['PID']
po_grouped_df = poagg_df.groupby(['EID','PCODE'])
#generate acc level aggregate
acc_df = po_grouped_df.agg({
'PVALUE' : np.sum,
'SI' : lambda x: np.sqrt(np.sum(x * np.exp(x-1))),
'SC' : np.sum,
'EE' : np.sum
})
This works fine until I want to join on the original table:
IN:>>
po_account_df = pd.merge(acc_df, po_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
OUT:>>
KeyError: u’no item named EID’
For some reason, the grouped dataframe can’t join back to the original table. I’ve looked at ways of trying to convert the groupby columns to actual columns but that doesn’t seem to work.
Please note, the end goal is to be able to find the percentage for each column (PVALUE, SI, SC, EE) IE:
pol_acc_df['PVALUE_PCT'] = np.round(pol_acc_df.PVALUE_Po/pol_acc_df.PVALUE_Acc,4)
Thanks!
Answers:
By default, groupby
output has the grouping columns as indicies, not columns, which is why the merge is failing.
There are a couple different ways to handle it, probably the easiest is using the as_index
parameter when you define the groupby object.
po_grouped_df = poagg_df.groupby(['EID','PCODE'], as_index=False)
Then, your merge should work as expected.
In [356]: pd.merge(acc_df, pol_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
Out[356]:
EID PCODE SC_Acc EE_Acc SI_Acc PVALUE_Acc EE_Po PVALUE_Po
0 123 GR 236 40000 1.805222e+31 350 10000 50
1 123 GR 236 40000 1.805222e+31 350 30000 300
2 123 GU 443 12000 8.765549e+87 250 10000 100
3 123 GU 443 12000 8.765549e+87 250 2000 150
SC_Po SI_Po
0 23 40
1 213 140
2 230 400
3 213 140
From the pandas docs:
Transformation: perform some group-specific computations and return a like-indexed object
Unfortunately, transform
works series by series, so you wouldn’t be able to perform multiple functions on multiple columns as you’ve done with agg
, but transform
does allow you to skip merge
po_grouped_df = pol_df.groupby(['EID','PCODE'])
pol_df['sum_pval'] = po_grouped_df['PVALUE'].transform(sum)
pol_df['func_si'] = po_grouped_df['SI'].transform(lambda x: np.sqrt(np.sum(x * np.exp(x-1))))
pol_df['sum_sc'] = po_grouped_df['SC'].transform(sum)
pol_df['sum_ee'] = po_grouped_df['EE'].transform(sum)
pol_df
Results in:
PID EID PCODE PVALUE SI SC EE sum_pval func_si sum_sc sum_ee
1 123 GU 100 400 230 10000 250 8.765549e+87 443 12000
1 123 GR 50 40 23 10000 350 1.805222e+31 236 40000
2 123 GU 150 140 213 2000 250 8.765549e+87 443 12000
2 123 GR 300 140 213 30000 350 1.805222e+31 236 40000
For more info, check out this SO answer.
I’m trying to write a function to aggregate and perform various stats calcuations on a dataframe in Pandas and then merge it to the original dataframe however, I’m running to issues. This is code equivalent in SQL:
SELECT EID,
PCODE,
SUM(PVALUE) AS PVALUE,
SUM(SQRT(SC*EXP(SC-1))) AS SC,
SUM(SI) AS SI,
SUM(EE) AS EE
INTO foo_bar_grp
FROM foo_bar
GROUP BY EID, PCODE
And then join on the original table:
SELECT *
FROM foo_bar_grp INNER JOIN
foo_bar ON foo_bar.EID = foo_bar_grp.EID
AND foo_bar.PCODE = foo_bar_grp.PCODE
Here are the steps: Loading the data
IN:>>
pol_dict = {'PID':[1,1,2,2],
'EID':[123,123,123,123],
'PCODE':['GU','GR','GU','GR'],
'PVALUE':[100,50,150,300],
'SI':[400,40,140,140],
'SC':[230,23,213,213],
'EE':[10000,10000,2000,30000],
}
pol_df = DataFrame(pol_dict)
pol_df
OUT:>>
EID EE PCODE PID PVALUE SC SI
0 123 10000 GU 1 100 230 400
1 123 10000 GR 1 50 23 40
2 123 2000 GU 2 150 213 140
3 123 30000 GR 2 300 213 140
Step 2: Calculating and Grouping on the data:
My pandas code is as follows:
#create aggregation dataframe
poagg_df = pol_df
del poagg_df['PID']
po_grouped_df = poagg_df.groupby(['EID','PCODE'])
#generate acc level aggregate
acc_df = po_grouped_df.agg({
'PVALUE' : np.sum,
'SI' : lambda x: np.sqrt(np.sum(x * np.exp(x-1))),
'SC' : np.sum,
'EE' : np.sum
})
This works fine until I want to join on the original table:
IN:>>
po_account_df = pd.merge(acc_df, po_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
OUT:>>
KeyError: u’no item named EID’
For some reason, the grouped dataframe can’t join back to the original table. I’ve looked at ways of trying to convert the groupby columns to actual columns but that doesn’t seem to work.
Please note, the end goal is to be able to find the percentage for each column (PVALUE, SI, SC, EE) IE:
pol_acc_df['PVALUE_PCT'] = np.round(pol_acc_df.PVALUE_Po/pol_acc_df.PVALUE_Acc,4)
Thanks!
By default, groupby
output has the grouping columns as indicies, not columns, which is why the merge is failing.
There are a couple different ways to handle it, probably the easiest is using the as_index
parameter when you define the groupby object.
po_grouped_df = poagg_df.groupby(['EID','PCODE'], as_index=False)
Then, your merge should work as expected.
In [356]: pd.merge(acc_df, pol_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
Out[356]:
EID PCODE SC_Acc EE_Acc SI_Acc PVALUE_Acc EE_Po PVALUE_Po
0 123 GR 236 40000 1.805222e+31 350 10000 50
1 123 GR 236 40000 1.805222e+31 350 30000 300
2 123 GU 443 12000 8.765549e+87 250 10000 100
3 123 GU 443 12000 8.765549e+87 250 2000 150
SC_Po SI_Po
0 23 40
1 213 140
2 230 400
3 213 140
From the pandas docs:
Transformation: perform some group-specific computations and return a like-indexed object
Unfortunately, transform
works series by series, so you wouldn’t be able to perform multiple functions on multiple columns as you’ve done with agg
, but transform
does allow you to skip merge
po_grouped_df = pol_df.groupby(['EID','PCODE'])
pol_df['sum_pval'] = po_grouped_df['PVALUE'].transform(sum)
pol_df['func_si'] = po_grouped_df['SI'].transform(lambda x: np.sqrt(np.sum(x * np.exp(x-1))))
pol_df['sum_sc'] = po_grouped_df['SC'].transform(sum)
pol_df['sum_ee'] = po_grouped_df['EE'].transform(sum)
pol_df
Results in:
PID EID PCODE PVALUE SI SC EE sum_pval func_si sum_sc sum_ee
1 123 GU 100 400 230 10000 250 8.765549e+87 443 12000
1 123 GR 50 40 23 10000 350 1.805222e+31 236 40000
2 123 GU 150 140 213 2000 250 8.765549e+87 443 12000
2 123 GR 300 140 213 30000 350 1.805222e+31 236 40000
For more info, check out this SO answer.