Python Pandas: Is Order Preserved When Using groupby() and agg()?
Question:
I’ve frequented used pandas’ agg()
function to run summary statistics on every column of a data.frame. For example, here’s how you would produce the mean and standard deviation:
df = pd.DataFrame({'A': ['group1', 'group1', 'group2', 'group2', 'group3', 'group3'],
'B': [10, 12, 10, 25, 10, 12],
'C': [100, 102, 100, 250, 100, 102]})
>>> df
[output]
A B C
0 group1 10 100
1 group1 12 102
2 group2 10 100
3 group2 25 250
4 group3 10 100
5 group3 12 102
In both of those cases, the order that individual rows are sent to the agg function does not matter. But consider the following example, which:
df.groupby('A').agg([np.mean, lambda x: x.iloc[1] ])
[output]
mean <lambda> mean <lambda>
A
group1 11.0 12 101 102
group2 17.5 25 175 250
group3 11.0 12 101 102
In this case the lambda functions as intended, outputting the second row in each group. However, I have not been able to find anything in the pandas documentation that implies that this is guaranteed to be true in all cases. I want use agg()
along with a weighted average function, so I want to be sure that the rows that come into the function will be in the same order as they appear in the original data frame.
Does anyone know, ideally via somewhere in the docs or pandas source code, if this is guaranteed to be the case?
Answers:
See this enhancement issue
The short answer is yes, the groupby will preserve the orderings as passed in. You can prove this by using your example like this:
In [20]: df.sort_index(ascending=False).groupby('A').agg([np.mean, lambda x: x.iloc[1] ])
Out[20]:
B C
mean <lambda> mean <lambda>
A
group1 11.0 10 101 100
group2 17.5 10 175 100
group3 11.0 10 101 100
This is NOT true for resample however as it requires a monotonic index (it WILL work with a non-monotonic index, but will sort it first).
Their is a sort=
flag to groupby, but this relates to the sorting of the groups themselves and not the observations within a group.
FYI: df.groupby('A').nth(1)
is a safe way to get the 2nd value of a group (as your method above will fail if a group has < 2 elements)
Even easier:
import pandas as pd
pd.pivot_table(df,index='A',aggfunc=(np.mean))
output:
B C
A
group1 11.0 101
group2 17.5 175
group3 11.0 101
Panda’s 0.19.1 doc says “groupby preserves the order of rows within each group”, so this is guaranteed behavior.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
In order to preserve order, you’ll need to pass .groupby(..., sort=False)
. In your case the grouping column is already sorted, so it does not make difference, but generally one must use the sort=False
flag:
df.groupby('A', sort=False).agg([np.mean, lambda x: x.iloc[1] ])
Reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
The API accepts “SORT” as an argument.
Description for SORT argument is like this:
sort : bool, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
Thus, it is clear the “Groupby” does preserve the order of rows within each group.
Unfortunately the answer to this question is NO. In the past few days I’ve created an algorithm for non-uniform chunking and found that is cannot possibly retain order because a groupby introduces subframes where the key to each frame is the groupby input. So you end up with:
allSubFrames = df.groupby("myColumnToOrderBy")
for orderKey, individualSubFrame in allSubFrames:
do something...
Because its using dictionaries you lose the ordering.
If you perform a sort afterwards, as mentioned above, which I’ve just tested for a massive dataset, you end up with an O(n log n) computation.
However, I found that if you have for instance ordered time series data in order, where you want to preserve the order, it is better to change the ordering column into a list and then create a counter that records the first item in each time series. This results in a O(n) calculation.
So, essentially if you are using a relatively small dataset the proposed answers above are reasonable, but if using a big data set you need to consider avoiding groupby and sort. Instead use: list(df['myColumnToOrderBy'])
and iterator over it.
I’ve frequented used pandas’ agg()
function to run summary statistics on every column of a data.frame. For example, here’s how you would produce the mean and standard deviation:
df = pd.DataFrame({'A': ['group1', 'group1', 'group2', 'group2', 'group3', 'group3'],
'B': [10, 12, 10, 25, 10, 12],
'C': [100, 102, 100, 250, 100, 102]})
>>> df
[output]
A B C
0 group1 10 100
1 group1 12 102
2 group2 10 100
3 group2 25 250
4 group3 10 100
5 group3 12 102
In both of those cases, the order that individual rows are sent to the agg function does not matter. But consider the following example, which:
df.groupby('A').agg([np.mean, lambda x: x.iloc[1] ])
[output]
mean <lambda> mean <lambda>
A
group1 11.0 12 101 102
group2 17.5 25 175 250
group3 11.0 12 101 102
In this case the lambda functions as intended, outputting the second row in each group. However, I have not been able to find anything in the pandas documentation that implies that this is guaranteed to be true in all cases. I want use agg()
along with a weighted average function, so I want to be sure that the rows that come into the function will be in the same order as they appear in the original data frame.
Does anyone know, ideally via somewhere in the docs or pandas source code, if this is guaranteed to be the case?
See this enhancement issue
The short answer is yes, the groupby will preserve the orderings as passed in. You can prove this by using your example like this:
In [20]: df.sort_index(ascending=False).groupby('A').agg([np.mean, lambda x: x.iloc[1] ])
Out[20]:
B C
mean <lambda> mean <lambda>
A
group1 11.0 10 101 100
group2 17.5 10 175 100
group3 11.0 10 101 100
This is NOT true for resample however as it requires a monotonic index (it WILL work with a non-monotonic index, but will sort it first).
Their is a sort=
flag to groupby, but this relates to the sorting of the groups themselves and not the observations within a group.
FYI: df.groupby('A').nth(1)
is a safe way to get the 2nd value of a group (as your method above will fail if a group has < 2 elements)
Even easier:
import pandas as pd
pd.pivot_table(df,index='A',aggfunc=(np.mean))
output:
B C
A
group1 11.0 101
group2 17.5 175
group3 11.0 101
Panda’s 0.19.1 doc says “groupby preserves the order of rows within each group”, so this is guaranteed behavior.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
In order to preserve order, you’ll need to pass .groupby(..., sort=False)
. In your case the grouping column is already sorted, so it does not make difference, but generally one must use the sort=False
flag:
df.groupby('A', sort=False).agg([np.mean, lambda x: x.iloc[1] ])
Reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
The API accepts “SORT” as an argument.
Description for SORT argument is like this:
sort : bool, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
Thus, it is clear the “Groupby” does preserve the order of rows within each group.
Unfortunately the answer to this question is NO. In the past few days I’ve created an algorithm for non-uniform chunking and found that is cannot possibly retain order because a groupby introduces subframes where the key to each frame is the groupby input. So you end up with:
allSubFrames = df.groupby("myColumnToOrderBy")
for orderKey, individualSubFrame in allSubFrames:
do something...
Because its using dictionaries you lose the ordering.
If you perform a sort afterwards, as mentioned above, which I’ve just tested for a massive dataset, you end up with an O(n log n) computation.
However, I found that if you have for instance ordered time series data in order, where you want to preserve the order, it is better to change the ordering column into a list and then create a counter that records the first item in each time series. This results in a O(n) calculation.
So, essentially if you are using a relatively small dataset the proposed answers above are reasonable, but if using a big data set you need to consider avoiding groupby and sort. Instead use: list(df['myColumnToOrderBy'])
and iterator over it.