Get group id back into pandas dataframe
Question:
For dataframe
In [2]: df = pd.DataFrame({'Name': ['foo', 'bar'] * 3,
...: 'Rank': np.random.randint(0,3,6),
...: 'Val': np.random.rand(6)})
...: df
Out[2]:
Name Rank Val
0 foo 0 0.299397
1 bar 0 0.909228
2 foo 0 0.517700
3 bar 0 0.929863
4 foo 1 0.209324
5 bar 2 0.381515
I’m interested in grouping by Name and Rank and possibly getting aggregate values
In [3]: group = df.groupby(['Name', 'Rank'])
In [4]: agg = group.agg(sum)
In [5]: agg
Out[5]:
Val
Name Rank
bar 0 1.839091
2 0.381515
foo 0 0.817097
1 0.209324
But I would like to get a field in the original df
that contains the group number for that row, like
In [13]: df['Group_id'] = [2, 0, 2, 0, 3, 1]
In [14]: df
Out[14]:
Name Rank Val Group_id
0 foo 0 0.299397 2
1 bar 0 0.909228 0
2 foo 0 0.517700 2
3 bar 0 0.929863 0
4 foo 1 0.209324 3
5 bar 2 0.381515 1
Is there a good way to do this in pandas?
I can get it with python,
In [16]: from itertools import count
In [17]: c = count()
In [22]: group.transform(lambda x: c.next())
Out[22]:
Val
0 2
1 0
2 2
3 0
4 3
5 1
but it’s pretty slow on a large dataframe, so I figured there may be a better built in pandas way to do this.
Answers:
A lot of handy things are stored in the DataFrameGroupBy.grouper
object. For example:
>>> df = pd.DataFrame({'Name': ['foo', 'bar'] * 3,
'Rank': np.random.randint(0,3,6),
'Val': np.random.rand(6)})
>>> grouped = df.groupby(["Name", "Rank"])
>>> grouped.grouper.
grouped.grouper.agg_series grouped.grouper.indices
grouped.grouper.aggregate grouped.grouper.labels
grouped.grouper.apply grouped.grouper.levels
grouped.grouper.axis grouped.grouper.names
grouped.grouper.compressed grouped.grouper.ngroups
grouped.grouper.get_group_levels grouped.grouper.nkeys
grouped.grouper.get_iterator grouped.grouper.result_index
grouped.grouper.group_info grouped.grouper.shape
grouped.grouper.group_keys grouped.grouper.size
grouped.grouper.groupings grouped.grouper.sort
grouped.grouper.groups
and so:
>>> df["GroupId"] = df.groupby(["Name", "Rank"]).grouper.group_info[0]
>>> df
Name Rank Val GroupId
0 foo 0 0.302482 2
1 bar 0 0.375193 0
2 foo 2 0.965763 4
3 bar 2 0.166417 1
4 foo 1 0.495124 3
5 bar 2 0.728776 1
There may be a nicer alias for for grouper.group_info[0]
lurking around somewhere, but this should work, anyway.
The correct solution is to use grouper.label_info
:
df["GroupId"] = df.groupby(["Name", "Rank"]).grouper.label_info
It automatically associates each row in the df
dataframe to the corresponding group label.
Use GroupBy.ngroup
from pandas 0.20.2+:
df["GroupId"] = df.groupby(["Name", "Rank"]).ngroup()
print (df)
Name Rank Val GroupId
0 foo 2 0.451724 4
1 bar 0 0.944676 0
2 foo 0 0.822390 2
3 bar 2 0.063603 1
4 foo 1 0.938892 3
5 bar 2 0.332454 1
Previous answers do not mention how the group id within a group is assigned and whether it is replicable across multiple calls or across systems. Hence the ranking of item is not controlled by the user.
To address this issue, I use the following function to assign a rank to individual elements within each group. ‘sorter` enables me to control precisely how to assign a rank.
def group_rank_id(df, grouper, sorter):
# function to apply to each group
def group_fun(x): return x[sorter].reset_index(drop=True).reset_index().rename(columns={'index': 'rank'})
# apply and merge to itself
out = df.groupby(grouper).apply(group_fun).reset_index(drop=True)
return df.merge(out, on=sorter)
Example data:
df
action quantity ticker date price
0 buy 3.0 SXRV.DE 1.584662e+09 0.519707
1 buy 7.0 MSF.DE 1.599696e+09 0.998484
2 buy 1.0 ABEA.DE 1.600387e+09 0.538107
3 buy 1.0 AMZ.F 1.606349e+09 0.446594
4 buy 9.0 09KE.BE 1.610669e+09 0.383777
5 buy 11.0 09KF.BE 1.610669e+09 0.987921
6 buy 3.0 FB2A.MU 1.620173e+09 0.696381
7 buy 3.0 FB2A.MU 1.636070e+09 0.700757
will result in:
group_rank_id(df, 'ticker',['ticker','date'])
action quantity ticker date price rank
0 buy 3.0 SXRV.DE 1.584662e+09 0.519707 0
1 buy 7.0 MSF.DE 1.599696e+09 0.998484 0
2 buy 1.0 ABEA.DE 1.600387e+09 0.538107 0
3 buy 1.0 AMZ.F 1.606349e+09 0.446594 0
4 buy 9.0 09KE.BE 1.610669e+09 0.383777 0
5 buy 11.0 09KF.BE 1.610669e+09 0.987921 0
6 buy 3.0 FB2A.MU 1.620173e+09 0.696381 0
7 buy 3.0 FB2A.MU 1.636070e+09 0.700757 1
For dataframe
In [2]: df = pd.DataFrame({'Name': ['foo', 'bar'] * 3,
...: 'Rank': np.random.randint(0,3,6),
...: 'Val': np.random.rand(6)})
...: df
Out[2]:
Name Rank Val
0 foo 0 0.299397
1 bar 0 0.909228
2 foo 0 0.517700
3 bar 0 0.929863
4 foo 1 0.209324
5 bar 2 0.381515
I’m interested in grouping by Name and Rank and possibly getting aggregate values
In [3]: group = df.groupby(['Name', 'Rank'])
In [4]: agg = group.agg(sum)
In [5]: agg
Out[5]:
Val
Name Rank
bar 0 1.839091
2 0.381515
foo 0 0.817097
1 0.209324
But I would like to get a field in the original df
that contains the group number for that row, like
In [13]: df['Group_id'] = [2, 0, 2, 0, 3, 1]
In [14]: df
Out[14]:
Name Rank Val Group_id
0 foo 0 0.299397 2
1 bar 0 0.909228 0
2 foo 0 0.517700 2
3 bar 0 0.929863 0
4 foo 1 0.209324 3
5 bar 2 0.381515 1
Is there a good way to do this in pandas?
I can get it with python,
In [16]: from itertools import count
In [17]: c = count()
In [22]: group.transform(lambda x: c.next())
Out[22]:
Val
0 2
1 0
2 2
3 0
4 3
5 1
but it’s pretty slow on a large dataframe, so I figured there may be a better built in pandas way to do this.
A lot of handy things are stored in the DataFrameGroupBy.grouper
object. For example:
>>> df = pd.DataFrame({'Name': ['foo', 'bar'] * 3,
'Rank': np.random.randint(0,3,6),
'Val': np.random.rand(6)})
>>> grouped = df.groupby(["Name", "Rank"])
>>> grouped.grouper.
grouped.grouper.agg_series grouped.grouper.indices
grouped.grouper.aggregate grouped.grouper.labels
grouped.grouper.apply grouped.grouper.levels
grouped.grouper.axis grouped.grouper.names
grouped.grouper.compressed grouped.grouper.ngroups
grouped.grouper.get_group_levels grouped.grouper.nkeys
grouped.grouper.get_iterator grouped.grouper.result_index
grouped.grouper.group_info grouped.grouper.shape
grouped.grouper.group_keys grouped.grouper.size
grouped.grouper.groupings grouped.grouper.sort
grouped.grouper.groups
and so:
>>> df["GroupId"] = df.groupby(["Name", "Rank"]).grouper.group_info[0]
>>> df
Name Rank Val GroupId
0 foo 0 0.302482 2
1 bar 0 0.375193 0
2 foo 2 0.965763 4
3 bar 2 0.166417 1
4 foo 1 0.495124 3
5 bar 2 0.728776 1
There may be a nicer alias for for grouper.group_info[0]
lurking around somewhere, but this should work, anyway.
The correct solution is to use grouper.label_info
:
df["GroupId"] = df.groupby(["Name", "Rank"]).grouper.label_info
It automatically associates each row in the df
dataframe to the corresponding group label.
Use GroupBy.ngroup
from pandas 0.20.2+:
df["GroupId"] = df.groupby(["Name", "Rank"]).ngroup()
print (df)
Name Rank Val GroupId
0 foo 2 0.451724 4
1 bar 0 0.944676 0
2 foo 0 0.822390 2
3 bar 2 0.063603 1
4 foo 1 0.938892 3
5 bar 2 0.332454 1
Previous answers do not mention how the group id within a group is assigned and whether it is replicable across multiple calls or across systems. Hence the ranking of item is not controlled by the user.
To address this issue, I use the following function to assign a rank to individual elements within each group. ‘sorter` enables me to control precisely how to assign a rank.
def group_rank_id(df, grouper, sorter):
# function to apply to each group
def group_fun(x): return x[sorter].reset_index(drop=True).reset_index().rename(columns={'index': 'rank'})
# apply and merge to itself
out = df.groupby(grouper).apply(group_fun).reset_index(drop=True)
return df.merge(out, on=sorter)
Example data:
df
action quantity ticker date price
0 buy 3.0 SXRV.DE 1.584662e+09 0.519707
1 buy 7.0 MSF.DE 1.599696e+09 0.998484
2 buy 1.0 ABEA.DE 1.600387e+09 0.538107
3 buy 1.0 AMZ.F 1.606349e+09 0.446594
4 buy 9.0 09KE.BE 1.610669e+09 0.383777
5 buy 11.0 09KF.BE 1.610669e+09 0.987921
6 buy 3.0 FB2A.MU 1.620173e+09 0.696381
7 buy 3.0 FB2A.MU 1.636070e+09 0.700757
will result in:
group_rank_id(df, 'ticker',['ticker','date'])
action quantity ticker date price rank
0 buy 3.0 SXRV.DE 1.584662e+09 0.519707 0
1 buy 7.0 MSF.DE 1.599696e+09 0.998484 0
2 buy 1.0 ABEA.DE 1.600387e+09 0.538107 0
3 buy 1.0 AMZ.F 1.606349e+09 0.446594 0
4 buy 9.0 09KE.BE 1.610669e+09 0.383777 0
5 buy 11.0 09KF.BE 1.610669e+09 0.987921 0
6 buy 3.0 FB2A.MU 1.620173e+09 0.696381 0
7 buy 3.0 FB2A.MU 1.636070e+09 0.700757 1