Set column name for apply result over groupby
Question:
This is a fairly trivial problem, but its triggering my OCD and I haven’t been able to find a suitable solution for the past half hour.
For background, I’m looking to calculate a value (let’s call it F) for each group in a DataFrame derived from different aggregated measures of columns in the existing DataFrame.
Here’s a toy example of what I’m trying to do:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['X', 'Y', 'X', 'Y', 'Y', 'Y', 'Y', 'X', 'Y', 'X'],
'B': ['N', 'N', 'N', 'M', 'N', 'M', 'M', 'N', 'M', 'N'],
'C': [69, 83, 28, 25, 11, 31, 14, 37, 14, 0],
'D': [ 0.3, 0.1, 0.1, 0.8, 0.8, 0. , 0.8, 0.8, 0.1, 0.8],
'E': [11, 11, 12, 11, 11, 12, 12, 11, 12, 12]
})
df_grp = df.groupby(['A','B'])
df_grp.apply(lambda x: x['C'].sum() * x['D'].mean() / x['E'].max())
What I’d like to do is assign a name to the result of apply
(or lambda
). Is there anyway to do this without moving lambda
to a named function or renaming the column after running the last line?
Answers:
You could convert your series
to a dataframe
using reset_index()
and provide name='yout_col_name'
— The name of the column corresponding to the Series values
(df_grp.apply(lambda x: x['C'].sum() * x['D'].mean() / x['E'].max())
.reset_index(name='your_col_name'))
A B your_col_name
0 X N 5.583333
1 Y M 2.975000
2 Y N 3.845455
Have the lambda function return a new Series:
df_grp.apply(lambda x: pd.Series({'new_name':
x['C'].sum() * x['D'].mean() / x['E'].max()}))
# or df_grp.apply(lambda x: x['C'].sum() * x['D'].mean() / x['E'].max()).to_frame('new_name')
new_name
A B
X N 5.583333
Y M 2.975000
N 3.845455
The accepted answer seems work for the current version of Pandas, but name
is not one of the parameters of reset_index
according to the documentation. There is a names
argument, but it serves a different purpose IMO.
Since the output of apply is a series, we can simply use pandas.Series.rename() to achive the result.
df = pd.DataFrame({'A': ['X', 'Y', 'X', 'Y', 'Y', 'Y', 'Y', 'X', 'Y', 'X'],
'B': ['N', 'N', 'N', 'M', 'N', 'M', 'M', 'N', 'M', 'N'],
'C': [69, 83, 28, 25, 11, 31, 14, 37, 14, 0],
'D': [ 0.3, 0.1, 0.1, 0.8, 0.8, 0. , 0.8, 0.8, 0.1, 0.8],
'E': [11, 11, 12, 11, 11, 12, 12, 11, 12, 12]
})
df_grp = df.groupby(['A','B'])
df_grp.apply(lambda x: x['C'].sum() * x['D'].mean() / x['E'].max()).rename("your_col_name")
This is a fairly trivial problem, but its triggering my OCD and I haven’t been able to find a suitable solution for the past half hour.
For background, I’m looking to calculate a value (let’s call it F) for each group in a DataFrame derived from different aggregated measures of columns in the existing DataFrame.
Here’s a toy example of what I’m trying to do:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['X', 'Y', 'X', 'Y', 'Y', 'Y', 'Y', 'X', 'Y', 'X'],
'B': ['N', 'N', 'N', 'M', 'N', 'M', 'M', 'N', 'M', 'N'],
'C': [69, 83, 28, 25, 11, 31, 14, 37, 14, 0],
'D': [ 0.3, 0.1, 0.1, 0.8, 0.8, 0. , 0.8, 0.8, 0.1, 0.8],
'E': [11, 11, 12, 11, 11, 12, 12, 11, 12, 12]
})
df_grp = df.groupby(['A','B'])
df_grp.apply(lambda x: x['C'].sum() * x['D'].mean() / x['E'].max())
What I’d like to do is assign a name to the result of apply
(or lambda
). Is there anyway to do this without moving lambda
to a named function or renaming the column after running the last line?
You could convert your series
to a dataframe
using reset_index()
and provide name='yout_col_name'
— The name of the column corresponding to the Series values
(df_grp.apply(lambda x: x['C'].sum() * x['D'].mean() / x['E'].max())
.reset_index(name='your_col_name'))
A B your_col_name
0 X N 5.583333
1 Y M 2.975000
2 Y N 3.845455
Have the lambda function return a new Series:
df_grp.apply(lambda x: pd.Series({'new_name':
x['C'].sum() * x['D'].mean() / x['E'].max()}))
# or df_grp.apply(lambda x: x['C'].sum() * x['D'].mean() / x['E'].max()).to_frame('new_name')
new_name
A B
X N 5.583333
Y M 2.975000
N 3.845455
The accepted answer seems work for the current version of Pandas, but name
is not one of the parameters of reset_index
according to the documentation. There is a names
argument, but it serves a different purpose IMO.
Since the output of apply is a series, we can simply use pandas.Series.rename() to achive the result.
df = pd.DataFrame({'A': ['X', 'Y', 'X', 'Y', 'Y', 'Y', 'Y', 'X', 'Y', 'X'],
'B': ['N', 'N', 'N', 'M', 'N', 'M', 'M', 'N', 'M', 'N'],
'C': [69, 83, 28, 25, 11, 31, 14, 37, 14, 0],
'D': [ 0.3, 0.1, 0.1, 0.8, 0.8, 0. , 0.8, 0.8, 0.1, 0.8],
'E': [11, 11, 12, 11, 11, 12, 12, 11, 12, 12]
})
df_grp = df.groupby(['A','B'])
df_grp.apply(lambda x: x['C'].sum() * x['D'].mean() / x['E'].max()).rename("your_col_name")