python pandas column with averages

Question:

I have a dataframe with in column "A" locations and in column "B" values. Locations occure multiple times in this DataFrame, now i’d like to add a third column in which i store the average value of column "B" that have the same location value in column "A".

-I know the .mean() can be used to get an average

-I know how to filter with .loc()

I could make a list of all unique values in column A, and compute the average for all of them by making a for loop. Hover, this seems combersome to me. Any idea how this can be done more efficiently?

Asked By: Cornelis

||

Answers:

I could make a list of all unique values in column A, and compute the
average for all of them by making a for loop.

This can be done using pandas.DataFrame.groupby consider following simple example

import pandas as pd
df = pd.DataFrame({"A":["X","Y","Y","X","X"],"B":[1,3,7,10,20]})
means = df.groupby('A').agg('mean')
print(means)

gives output

           B
A
X  10.333333
Y   5.000000
Answered By: Daweo

Sounds like what you need is GroupBy. Take a look here

Given

df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
                   'B': [np.nan, 2, 3, 4, 5],
                   'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])

You can use

df.groupby('A').mean()

to group the values based on the common values in column "A" and find the mean.

Output:

     B         C
A
1  3.0  1.333333
2  4.0  1.500000
Answered By: Ajay Sachdev
import pandas as pd

data = {'A': ['a', 'a', 'b', 'c'], 'B': [32, 61, 40, 45]}
df = pd.DataFrame(data)

df2 = df.groupby(['A']).mean()
print(df2)
Answered By: Jessica King

Based on your description, I’m not sure if you are trying to simply calculate the averages for each group, or if you are wanting to maintain the long format of your data. I’ll break down a solution for each option.

The data I’ll use below can be generated by running the following…

import pandas as pd
df = pd.DataFrame([['group1 ', 2],
                   ['group2 ', 4],
                   ['group1 ', 5],
                   ['group2 ', 2],
                   ['group1 ', 2],
                   ['group2 ', 0]], columns=['A', 'B'])

Option 1 – Calculate Group Averages

This one is super simple. It uses the .groupby method, which is the bread and butter of crunching data calculations.

df.groupby('A').B.mean()

Output:

A
group1     3.0
group2     2.0

If you wish for this to return a dataframe instead of a series, you can add .to_frame() to the end of the above line.

Option 2 – Calculate Group Averages and Maintain Long Format

By long format, I mean you want your data to be structured the same as it is currently, but with a third column (we’ll call it C) containing a mean that is connected to the A column. ie…

A B C (average)
group1 2 3
group2 4 2
group1 5 3
group2 2 2
group1 2 3
group2 0 2

Where the averages for each group are…

group1 = (2+5+2)/3 = 3
group2 = (4+2+0)/3 = 2

The most efficient solution, would be to use .transform, which behaves like an sql window function, but I think this method can be a little confusing when you’re new to pandas.

import numpy as np
df.assign(C=df.groupby('A').B.transform(np.mean))

A less efficient, but more beginner friendly option would be to store the averages in a dictionary and then map each row to the group average.

I find myself using this option a lot for modeling projects, when I want to impute a historical average rather than the average of my sampled data.

To accomplish this, you can…

  1. Create a dictionary containing the grouped averages
  2. For every row in the dataframe, pass the group name into the dictionary
# Create the group averages
group_averages = df.groupby('A').B.mean().to_dict()
# For every row, pass the group name into the dictionary
new_column = df.A.map(group_averages)
# Add the new column to the dataframe
df = df.assign(C=new_column)

You can also, optionally, do all of this in a single line

df = df.assign(C=df.A.map(df.groupby('A').B.mean().to_dict()))
Answered By: Joél Collins
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.