Calculate conditional probabilities in pandas

Question:

I’m trying to calculate a conditional response probabilities when aggregating my dataset. Take the following toy example:

import pandas as pd

gender = [0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]
is_family = [0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1]
treatment = [0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1]
response = [1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1]
num_rows = [10,10,5,20,0,5,10,30,20,30,10,5,60,10,10,20]

df = pd.DataFrame(data={'gender': gender, 'is_family': is_family, 'treatment': treatment, 'response': response, 'num_rows': num_rows})
    gender  is_family  treatment  response  num_rows
0        0          0          0         1        10
1        0          0          1         0        10
2        0          0          0         0         5
3        0          0          1         1        20
4        0          1          0         1         0
5        0          1          1         0         5
6        0          1          0         0        10
7        0          1          1         1        30
8        1          0          0         1        20
9        1          0          1         0        30
10       1          0          0         0        10
11       1          0          1         1         5
12       1          1          0         1        60
13       1          1          1         0        10
14       1          1          0         0        10
15       1          1          1         1        20

When grouping and aggregating by gender, treatment, and response I want to (1) sum the number of rows for each group and (2) calculate the probability of response given treatment. The result should look like this

   gender  treatment  response  num_rows  resp_prob
0       0          0         0        15   0.600000
1       0          0         1        10   0.400000
2       0          1         0        15   0.230769
3       0          1         1        50   0.769231
4       1          0         0        20   0.200000
5       1          0         1        80   0.800000
6       1          1         0        40   0.615385
7       1          1         1        25   0.384615

The first response probability is calculated as follows: 15 (response=0, treatment=0) / 25 (treatment=0) = 0.6. The third response probability is calculated as follows: 15 / 65 = 0.23. Etc.

I can sum up the number of samples for each group with:

df.groupby(by=['gender', 'treatment', 'response'])['num_rows'].sum().reset_index()

but what about the probabilities?

Any ideas?

Asked By: Gusto

||

Answers:

IIUC, use a double groupby:

(df.groupby(by=['gender', 'treatment', 'response'],
            as_index=False)
   ['num_rows'].sum()
   .assign(resp_prob=lambda d: d['num_rows'].div(
                                d.groupby(['gender', 'treatment'])
                                ['num_rows'].transform('sum'))
          )
)

output:

   gender  treatment  response  num_rows  resp_prob
0       0          0         0        15   0.600000
1       0          0         1        10   0.400000
2       0          1         0        15   0.230769
3       0          1         1        50   0.769231
4       1          0         0        20   0.200000
5       1          0         1        80   0.800000
6       1          1         0        40   0.615385
7       1          1         1        25   0.384615
Answered By: mozway

You can do this:

df["resp_prob"] = df["num_rows"].div(
    df.groupby(["gender", "treatment"])["num_rows"].transform("sum")
)

So all you need is total per gender, treatment and then you already know individual totals in the form of num_rows, so you get probability for each as num_rows/total

output:

   gender  treatment  response  num_rows  resp_prob
0       0          0         0        15   0.600000
1       0          0         1        10   0.400000
2       0          1         0        15   0.230769
3       0          1         1        50   0.769231
4       1          0         0        20   0.200000
5       1          0         1        80   0.800000
6       1          1         0        40   0.615385
7       1          1         1        25   0.384615
Answered By: SomeDude
def function1(dd:pd.DataFrame):
    return dd.to_frame('num_rows').assign(resp_prob=dd/dd.sum())

df.groupby(by=['gender', 'treatment', 'response'])['num_rows'].sum()
    .groupby(level=[0,1]).apply(function1).reset_index()

   gender  treatment  response  num_rows  resp_prob
0       0          0         0        15   0.600000
1       0          0         1        10   0.400000
2       0          1         0        15   0.230769
3       0          1         1        50   0.769231
4       1          0         0        20   0.200000
5       1          0         1        80   0.800000
6       1          1         0        40   0.615385
7       1          1         1        25   0.384615
Answered By: G.G
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.