Pandas complex processing with groupby

Question:

My data is grouped by id. In each group, it is sorted by colB. The logic I need to implement is as follows:

If colA is blank, and colD is either (2,3, or 4),
then create a column called ‘flag’ and set flag = 1 in the last non-zero row of colC. Set the flag to 0 in all the other rows of that group, where colC is non-zero.
Remove the rows where (colA is blank, and colC is 0) for that particular grouping.

Repeat above procedure for all other ‘id’ groups.

(For rows where colA is non-blank, I can set the flag to what I need.)

Here is the data I have:

id  colA    ColB    colC      colD
1           10      1352.23   2
1           11      706.87    2
1           12      1116.6    2
1           13      0         2
1           14      0         2
1           15      0         2
2           2      6884.03    3
2           3      2235.97    3
2           4      3618.04    3
2           5      11745.42   3
3   2013    1      345.98     0

and here is what I would like to get after processing it.

id  colA  ColB  colC      colD  flag
1         10    1352.23     2   0
1         11    706.87      2   0
1         12    1116.6      2   1
2          2    6884.03     3   0
2          3    2235.97     3   0
2          4    3618.04     3   0
2          5    11745.42    3   1
3   2013   1    345.98      0   0

The data contains many thousands of such groupings. I am hoping someone can help me in figuring out what the Python code to do the above processing would look like. I have a basic familiarity with the groupby function, but not to the extent to be able to figure out how to do the above.


Here is the code I am trying to use. The code give errors:
“AttributeError: ‘str’ object has no attribute ‘id’.”

I am trying to set the “flag” to NaN when I detect the zeros in colC that I eventually want to remove, so I can drop them easily, in a later step.

def setFlag(grouped):
    for name, group in grouped:
        for i in range(group.id.size):
            drop_candidate = (
                     pd.isnull(group.iloc[i]['colA'])&
                  ( (group.iloc[i]['colD'] == 2) |
                    (group.iloc[i]['colD'] == 3) |
                    (group.iloc[i]['colD'] == 4)    ) 
                )

            last_nonZero = group[group != 0].index[-1]

            if (  (drop_candidate & (group.iloc[i]['colC'] == 0))  ):
                group['flag'] = np.nan
            elif ((drop_candidate & (group.iloc[i]['colC'] != 0)) & (last_nonZero != i)):
                group['flag'] = 0
            elif last_nonZero == i:
                group['flag'] = 1

        return grouped

df.groupby('id').apply(setFlag)

Here is the code to re-create the test dataframe:

import pandas as pd
import numpy as np   
df = pd.DataFrame.from_items([
    ('id', [1,1,1,1,1,1,2,2,2,2,3]), 
    ('colA', [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2013]),
    ('colB', [10,11,12,13,14,15,2,3,4,5,1]),
    ('colC', [1352.23,706.87,1116.6,0,0,0,6884.03,2235.97,3618.04,11745.42,345.98]),
    ('colD', [2,2,2,2,2,2,3,3,3,3,0]),
    ('flag', [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,]),
    ])
Asked By: Learner

||

Answers:

It looks like there are three parts to your process:

1) Get rid of rows where colA is null and colC == 0. Work on reducing your dataframe first

if it is AND logic:

reduced_df = df.loc[(df.colA.notnull()) & (df.colC != 0), :].copy()

if it is OR logic:

reduced_df = df.loc[(df.colA.notnull()) | (df.colC != 0), :].copy()

    id  colA  colB      colC  colD  flag
0    1   NaN    10   1352.23     2   NaN
1    1   NaN    11    706.87     2   NaN
2    1   NaN    12   1116.60     2   NaN
6    2   NaN     2   6884.03     3   NaN
7    2   NaN     3   2235.97     3   NaN
8    2   NaN     4   3618.04     3   NaN
9    2   NaN     5  11745.42     3   NaN
10   3  2013     1    345.98     0   NaN

2) Now you are ready to work on part two which is flagging the last column of a group. Since the default flag value is 0, start with that

reduced_df.loc[:, 'flag'] = 0

3) You can find duplicate values using duplicated and then make sure colA is null

reduced_df.loc[~reduced_df.colD.duplicated(keep='last') & reduced_df.colA.isnull(), 'flag'] = 1

reduced_df

    id  colA  colB      colC  colD  flag
0    1   NaN    10   1352.23     2     0
1    1   NaN    11    706.87     2     0
2    1   NaN    12   1116.60     2     1
6    2   NaN     2   6884.03     3     0
7    2   NaN     3   2235.97     3     0
8    2   NaN     4   3618.04     3     0
9    2   NaN     5  11745.42     3     1
10   3  2013     1    345.98     0     0
Answered By: dmb

This is what I came up with using the apply method. I think it does what you are asking for:

df['flag'] = df['colD'].shift(-1) #use as a placeholder to compare consecutive 'colD' vals
df['flag'] = df.apply(lambda x: 1 if (x['flag']!=x['colD']) & 
                  (np.isnan(x['colA'])) & (x['colD']>0) else 0, axis=1) 

Please let me know if that works! (You’ll need to have numpy as np imported btw). Also, if you want to limit this to only cases of 2,3 & 4, you’ll have to change the last part from (x['colD']>0) to be (x['colD']>1) & (x['colD'] < 5)

Answered By: Greg Friedman
def function1(dd:pd.DataFrame):
    dd1=dd.loc[~(pd.isna(dd.colA)&dd.colC.eq(0))]
    if dd1['colA'].isna().all()&dd1['colD'].isin([2,3,4]).all():
        idx1=dd1.query("colC !=0").tail(1).index.tolist()
        return dd1.assign(flag=np.where(dd1.index.isin(idx1),1,0))
    return dd1.assign(flag=0)

df1.groupby('id').apply(function1)

out:

    id  colA  colB      colC  colD  flag
0    1   NaN    10   1352.23     2     0
1    1   NaN    11    706.87     2     0
2    1   NaN    12   1116.60     2     1
6    2   NaN     2   6884.03     3     0
7    2   NaN     3   2235.97     3     0
8    2   NaN     4   3618.04     3     0
9    2   NaN     5  11745.42     3     1
10   3  2013     1    345.98     0     0
Answered By: G.G
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.