python pandas column with averages
Question:
I have a dataframe with in column "A" locations and in column "B" values. Locations occure multiple times in this DataFrame, now i’d like to add a third column in which i store the average value of column "B" that have the same location value in column "A".
-I know the .mean() can be used to get an average
-I know how to filter with .loc()
I could make a list of all unique values in column A, and compute the average for all of them by making a for loop. Hover, this seems combersome to me. Any idea how this can be done more efficiently?
Answers:
I could make a list of all unique values in column A, and compute the
average for all of them by making a for loop.
This can be done using pandas.DataFrame.groupby
consider following simple example
import pandas as pd
df = pd.DataFrame({"A":["X","Y","Y","X","X"],"B":[1,3,7,10,20]})
means = df.groupby('A').agg('mean')
print(means)
gives output
B
A
X 10.333333
Y 5.000000
Sounds like what you need is GroupBy. Take a look here
Given
df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])
You can use
df.groupby('A').mean()
to group the values based on the common values in column "A" and find the mean.
Output:
B C
A
1 3.0 1.333333
2 4.0 1.500000
import pandas as pd
data = {'A': ['a', 'a', 'b', 'c'], 'B': [32, 61, 40, 45]}
df = pd.DataFrame(data)
df2 = df.groupby(['A']).mean()
print(df2)
Based on your description, I’m not sure if you are trying to simply calculate the averages for each group, or if you are wanting to maintain the long format of your data. I’ll break down a solution for each option.
The data I’ll use below can be generated by running the following…
import pandas as pd
df = pd.DataFrame([['group1 ', 2],
['group2 ', 4],
['group1 ', 5],
['group2 ', 2],
['group1 ', 2],
['group2 ', 0]], columns=['A', 'B'])
Option 1 – Calculate Group Averages
This one is super simple. It uses the .groupby
method, which is the bread and butter of crunching data calculations.
df.groupby('A').B.mean()
Output:
A
group1 3.0
group2 2.0
If you wish for this to return a dataframe instead of a series, you can add .to_frame()
to the end of the above line.
Option 2 – Calculate Group Averages and Maintain Long Format
By long format, I mean you want your data to be structured the same as it is currently, but with a third column (we’ll call it C
) containing a mean that is connected to the A
column. ie…
A
B
C (average)
group1
2
3
group2
4
2
group1
5
3
group2
2
2
group1
2
3
group2
0
2
Where the averages for each group are…
group1 = (2+5+2)/3 = 3
group2 = (4+2+0)/3 = 2
The most efficient solution, would be to use .transform
, which behaves like an sql window function, but I think this method can be a little confusing when you’re new to pandas.
import numpy as np
df.assign(C=df.groupby('A').B.transform(np.mean))
A less efficient, but more beginner friendly option would be to store the averages in a dictionary and then map each row to the group average.
I find myself using this option a lot for modeling projects, when I want to impute a historical average rather than the average of my sampled data.
To accomplish this, you can…
- Create a dictionary containing the grouped averages
- For every row in the dataframe, pass the group name into the dictionary
# Create the group averages
group_averages = df.groupby('A').B.mean().to_dict()
# For every row, pass the group name into the dictionary
new_column = df.A.map(group_averages)
# Add the new column to the dataframe
df = df.assign(C=new_column)
You can also, optionally, do all of this in a single line
df = df.assign(C=df.A.map(df.groupby('A').B.mean().to_dict()))
I have a dataframe with in column "A" locations and in column "B" values. Locations occure multiple times in this DataFrame, now i’d like to add a third column in which i store the average value of column "B" that have the same location value in column "A".
-I know the .mean() can be used to get an average
-I know how to filter with .loc()
I could make a list of all unique values in column A, and compute the average for all of them by making a for loop. Hover, this seems combersome to me. Any idea how this can be done more efficiently?
I could make a list of all unique values in column A, and compute the
average for all of them by making a for loop.
This can be done using pandas.DataFrame.groupby
consider following simple example
import pandas as pd
df = pd.DataFrame({"A":["X","Y","Y","X","X"],"B":[1,3,7,10,20]})
means = df.groupby('A').agg('mean')
print(means)
gives output
B
A
X 10.333333
Y 5.000000
Sounds like what you need is GroupBy. Take a look here
Given
df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])
You can use
df.groupby('A').mean()
to group the values based on the common values in column "A" and find the mean.
Output:
B C
A
1 3.0 1.333333
2 4.0 1.500000
import pandas as pd
data = {'A': ['a', 'a', 'b', 'c'], 'B': [32, 61, 40, 45]}
df = pd.DataFrame(data)
df2 = df.groupby(['A']).mean()
print(df2)
Based on your description, I’m not sure if you are trying to simply calculate the averages for each group, or if you are wanting to maintain the long format of your data. I’ll break down a solution for each option.
The data I’ll use below can be generated by running the following…
import pandas as pd
df = pd.DataFrame([['group1 ', 2],
['group2 ', 4],
['group1 ', 5],
['group2 ', 2],
['group1 ', 2],
['group2 ', 0]], columns=['A', 'B'])
Option 1 – Calculate Group Averages
This one is super simple. It uses the .groupby
method, which is the bread and butter of crunching data calculations.
df.groupby('A').B.mean()
Output:
A
group1 3.0
group2 2.0
If you wish for this to return a dataframe instead of a series, you can add .to_frame()
to the end of the above line.
Option 2 – Calculate Group Averages and Maintain Long Format
By long format, I mean you want your data to be structured the same as it is currently, but with a third column (we’ll call it C
) containing a mean that is connected to the A
column. ie…
A | B | C (average) |
---|---|---|
group1 | 2 | 3 |
group2 | 4 | 2 |
group1 | 5 | 3 |
group2 | 2 | 2 |
group1 | 2 | 3 |
group2 | 0 | 2 |
Where the averages for each group are…
group1 = (2+5+2)/3 = 3
group2 = (4+2+0)/3 = 2
The most efficient solution, would be to use .transform
, which behaves like an sql window function, but I think this method can be a little confusing when you’re new to pandas.
import numpy as np
df.assign(C=df.groupby('A').B.transform(np.mean))
A less efficient, but more beginner friendly option would be to store the averages in a dictionary and then map each row to the group average.
I find myself using this option a lot for modeling projects, when I want to impute a historical average rather than the average of my sampled data.
To accomplish this, you can…
- Create a dictionary containing the grouped averages
- For every row in the dataframe, pass the group name into the dictionary
# Create the group averages
group_averages = df.groupby('A').B.mean().to_dict()
# For every row, pass the group name into the dictionary
new_column = df.A.map(group_averages)
# Add the new column to the dataframe
df = df.assign(C=new_column)
You can also, optionally, do all of this in a single line
df = df.assign(C=df.A.map(df.groupby('A').B.mean().to_dict()))