Pandas group-by proportion of cumulative sum start from 0

Question

I have the following pandas Data Frame (without 2 the last columns):

   name        day   show-in-appointment    previous-missed-appointments   proportion-previous-missed
0  Jack   2020/01/01   show                              0                      0
1  Jack   2020/01/02   no-show                           0                      0
2  Jill   2020/01/02   no-show                           0                      0
3  Jack   2020/01/03   show                              1                      0.5
4  Jill   2020/01/03   show                              1                      1
5  Jill   2020/01/04   no-show                           1                      0.5
6  Jack   2020/01/04   show                              1                      0.33
7  Jill   2020/01/05   show                              2                      0.66
8  jack   2020/01/06   no-show                           1                      0.25
9  jack   2020/01/07   show                              2                 0.4>>>2(noshow)/5(noshow+show)

df = pd.DataFrame(
    data=np.asarray([
        ['Jack', 'Jack', 'Jill', 'Jack', 'Jill', 'Jill', 'Jack', 'Jill', 'jack', 'jack'],
        [
            '2020/01/01',
            '2020/01/02',
            '2020/01/02',
            '2020/01/03',
            '2020/01/03',
            '2020/01/04',
            '2020/01/04',
            '2020/01/05',
            '2020/01/06',
            '2020/01/07',
        ],
        ['show', 'no-show', 'no-show', 'show', 'show', 'no-show', 'show', 'show', 'no-show', 'show'],
    ]).T,
    columns=['name', 'day', 'show-in-appointment'],
)

previous-missed-appointments column is create like the code below:

df.name = df.name.str.capitalize()
df['order'] = df.index
df.day = pd.to_datetime(df.day)
df['noshow'] = df['show-in-appointment'].map({'show': 0, 'no-show': 1})
df = df.sort_values(by=['name', 'day'])
df['previous-missed-appointments'] = df.groupby('name').noshow.cumsum()
df.loc[df.noshow == 1, 'previous-missed-appointments'] -= 1
df = df.sort_values(by='order')
df = df.drop(columns=['noshow', 'order'])

********THE QUESTION IS

HOW CAN I CREATE THE LAST COLUMN ????*********

Asked By: pooya ensafi

||

Source

Answer 1

You can use cumsum and shift in groupby.apply for the first column, then divide by groupby.cumcount for the second column:

# ensure dates are sorted
df = df.sort_values(by='day', key=lambda s: pd.to_datetime(s, dayfirst=False))

m = df['show-in-appointment'].eq('no-show')

g = m.groupby(df['name'].str.casefold(), group_keys=False)
df['previous-missed-appointments'] =  (
  g.apply(lambda x: x.cumsum().shift(fill_value=0))
 )

df['proportion-previous-missed'] = (
    df['previous-missed-appointments'].div(g.cumcount()).fillna(0)
)

print(df)

NB. it is important that the input is first sorted by date (or name/date).

Output:

   name         day show-in-appointment  previous-missed-appointments  proportion-previous-missed
0  Jack  2020/01/01                show                             0                    0.000000
1  Jack  2020/01/02             no-show                             0                    0.000000
2  Jill  2020/01/02             no-show                             0                    0.000000
3  Jack  2020/01/03                show                             1                    0.500000
4  Jill  2020/01/03                show                             1                    1.000000
5  Jill  2020/01/04             no-show                             1                    0.500000
6  Jack  2020/01/04                show                             1                    0.333333
7  Jill  2020/01/05                show                             2                    0.666667
8  jack  2020/01/06             no-show                             1                    0.250000
9  jack  2020/01/07                show                             2                    0.400000

Intermediates (with names grouped together and shorter column names for clarity):

   name         day     show  previous-missed computation  proportion
0  Jack  2020/01/01     show                0         0/0        0.00
1  Jack  2020/01/02  no-show                0         0/1        0.00
3  Jack  2020/01/03     show                1         1/2        0.50
6  Jack  2020/01/04     show                1         1/3        0.33
8  Jack  2020/01/06  no-show                1         1/4        0.25
9  Jack  2020/01/07     show                2         2/5        0.40
2  Jill  2020/01/02  no-show                0         0/0        0.00
4  Jill  2020/01/03     show                1         1/1        1.00
5  Jill  2020/01/04  no-show                1         1/2        0.50
7  Jill  2020/01/05     show                2         2/3        0.67

Answered By: mozway

Pandas group-by proportion of cumulative sum start from 0

Question:

********THE QUESTION IS

Answers: