How to apply "first" and "last" functions to columns while using group by in pandas?
Question:
I have a data frame and I would like to group it by a particular column (or, in other words, by values from a particular column). I can do it in the following way: grouped = df.groupby(['ColumnName'])
.
I imagine the result of this operation as a table in which some cells can contain sets of values instead of single values. To get a usual table (i.e. a table in which every cell contains only one a single value) I need to indicate what function I want to use to transform the sets of values in the cells into single values.
For example I can replace sets of values by their sum, or by their minimal or maximal value. I can do it in the following way: grouped.sum()
or grouped.min()
and so on.
Now I want to use different functions for different columns. I figured out that I can do it in the following way: grouped.agg({'ColumnName1':sum, 'ColumnName2':min})
.
However, because of some reasons I cannot use first
. In more details, grouped.first()
works, but grouped.agg({'ColumnName1':first, 'ColumnName2':first})
does not work. As a result I get a NameError: NameError: name 'first' is not defined
. So, my question is: Why does it happen and how to resolve this problem.
ADDED
Here I found the following example:
grouped['D'].agg({'result1' : np.sum, 'result2' : np.mean})
May be I also need to use np
? But in my case python does not recognize “np”. Should I import it?
Answers:
I’m not sure if this is really the issue, but sum
and min
are Python built-ins that take some iterables as input, whereas first
is a method of pandas Series object, so maybe it’s not in your namespace. Moreover it takes something else as an input (the doc says some offset value).
I guess one way to get around it is to create your own first
function, and define it such that it takes a Series object as an input, e.g.:
def first(Series, offset):
return Series.first(offset)
or something like that..
I think the issue is that there are two different first
methods which share a name but act differently, one is for groupby objects and another for a Series/DataFrame (to do with timeseries).
To replicate the behaviour of the groupby first
method over a DataFrame using agg
you could use iloc[0]
(which gets the first row in each group (DataFrame/Series) by index):
grouped.agg(lambda x: x.iloc[0])
For example:
In [1]: df = pd.DataFrame([[1, 2], [3, 4]])
In [2]: g = df.groupby(0)
In [3]: g.first()
Out[3]:
1
0
1 2
3 4
In [4]: g.agg(lambda x: x.iloc[0])
Out[4]:
1
0
1 2
3 4
Analogously you can replicate last
using iloc[-1]
.
Note: This will works column-wise, et al:
g.agg({1: lambda x: x.iloc[0]})
In older version of pandas you could would use the irow method (e.g. x.irow(0)
, see previous edits.
A couple of updated notes:
This is better done using the nth
groupby method, which is much faster >=0.13:
g.nth(0) # first
g.nth(-1) # last
You have to take care a little, as the default behaviour for first
and last
ignores NaN rows… and IIRC for DataFrame groupbys it was broken pre-0.13… there’s a dropna
option for nth
.
You can use the strings rather than built-ins (though IIRC pandas spots it’s the sum
builtin and applies np.sum
):
grouped['D'].agg({'result1' : "sum", 'result2' : "mean"})
Instead of using first
or last
, use their string representations in the agg
method. For example on the OP’s case:
grouped = df.groupby(['ColumnName'])
grouped['D'].agg({'result1' : np.sum, 'result2' : np.mean})
#you can do the string representation for first and last
grouped['D'].agg({'result1' : 'first', 'result2' : 'last'})
I would use a custom aggregator as shown below.
d = pd.DataFrame([[1,"man"], [1, "woman"], [1, "girl"], [2,"man"], [2, "woman"]],columns = 'number family'.split())
d
Here is the output:
number family
0 1 man
1 1 woman
2 1 girl
3 2 man
4 2 woman
Now the Aggregation taking first and last elements.
d.groupby(by = "number").agg(firstFamily= ('family', lambda x: list(x)[0]), lastFamily =('family', lambda x: list(x)[-1]))
The output of this aggregation is shown below.
firstFamily lastFamily
number
1 man girl
2 man woman
I hope this helps.
c_df = b_df.groupby('time').agg(first_x=('x', lambda x: list(x)[0]),
last_x=('x', lambda x: list(x)[-1]),
last_y=('y', lambda x: list(x)[-1]))
I have a data frame and I would like to group it by a particular column (or, in other words, by values from a particular column). I can do it in the following way: grouped = df.groupby(['ColumnName'])
.
I imagine the result of this operation as a table in which some cells can contain sets of values instead of single values. To get a usual table (i.e. a table in which every cell contains only one a single value) I need to indicate what function I want to use to transform the sets of values in the cells into single values.
For example I can replace sets of values by their sum, or by their minimal or maximal value. I can do it in the following way: grouped.sum()
or grouped.min()
and so on.
Now I want to use different functions for different columns. I figured out that I can do it in the following way: grouped.agg({'ColumnName1':sum, 'ColumnName2':min})
.
However, because of some reasons I cannot use first
. In more details, grouped.first()
works, but grouped.agg({'ColumnName1':first, 'ColumnName2':first})
does not work. As a result I get a NameError: NameError: name 'first' is not defined
. So, my question is: Why does it happen and how to resolve this problem.
ADDED
Here I found the following example:
grouped['D'].agg({'result1' : np.sum, 'result2' : np.mean})
May be I also need to use np
? But in my case python does not recognize “np”. Should I import it?
I’m not sure if this is really the issue, but sum
and min
are Python built-ins that take some iterables as input, whereas first
is a method of pandas Series object, so maybe it’s not in your namespace. Moreover it takes something else as an input (the doc says some offset value).
I guess one way to get around it is to create your own first
function, and define it such that it takes a Series object as an input, e.g.:
def first(Series, offset):
return Series.first(offset)
or something like that..
I think the issue is that there are two different first
methods which share a name but act differently, one is for groupby objects and another for a Series/DataFrame (to do with timeseries).
To replicate the behaviour of the groupby first
method over a DataFrame using agg
you could use iloc[0]
(which gets the first row in each group (DataFrame/Series) by index):
grouped.agg(lambda x: x.iloc[0])
For example:
In [1]: df = pd.DataFrame([[1, 2], [3, 4]])
In [2]: g = df.groupby(0)
In [3]: g.first()
Out[3]:
1
0
1 2
3 4
In [4]: g.agg(lambda x: x.iloc[0])
Out[4]:
1
0
1 2
3 4
Analogously you can replicate last
using iloc[-1]
.
Note: This will works column-wise, et al:
g.agg({1: lambda x: x.iloc[0]})
In older version of pandas you could would use the irow method (e.g. x.irow(0)
, see previous edits.
A couple of updated notes:
This is better done using the nth
groupby method, which is much faster >=0.13:
g.nth(0) # first
g.nth(-1) # last
You have to take care a little, as the default behaviour for first
and last
ignores NaN rows… and IIRC for DataFrame groupbys it was broken pre-0.13… there’s a dropna
option for nth
.
You can use the strings rather than built-ins (though IIRC pandas spots it’s the sum
builtin and applies np.sum
):
grouped['D'].agg({'result1' : "sum", 'result2' : "mean"})
Instead of using first
or last
, use their string representations in the agg
method. For example on the OP’s case:
grouped = df.groupby(['ColumnName'])
grouped['D'].agg({'result1' : np.sum, 'result2' : np.mean})
#you can do the string representation for first and last
grouped['D'].agg({'result1' : 'first', 'result2' : 'last'})
I would use a custom aggregator as shown below.
d = pd.DataFrame([[1,"man"], [1, "woman"], [1, "girl"], [2,"man"], [2, "woman"]],columns = 'number family'.split())
d
Here is the output:
number family
0 1 man
1 1 woman
2 1 girl
3 2 man
4 2 woman
Now the Aggregation taking first and last elements.
d.groupby(by = "number").agg(firstFamily= ('family', lambda x: list(x)[0]), lastFamily =('family', lambda x: list(x)[-1]))
The output of this aggregation is shown below.
firstFamily lastFamily
number
1 man girl
2 man woman
I hope this helps.
c_df = b_df.groupby('time').agg(first_x=('x', lambda x: list(x)[0]),
last_x=('x', lambda x: list(x)[-1]),
last_y=('y', lambda x: list(x)[-1]))