Pandas complex processing with groupby
Question:
My data is grouped by id. In each group, it is sorted by colB. The logic I need to implement is as follows:
If colA is blank, and colD is either (2,3, or 4),
then create a column called ‘flag’ and set flag = 1 in the last non-zero row of colC. Set the flag to 0 in all the other rows of that group, where colC is non-zero.
Remove the rows where (colA is blank, and colC is 0) for that particular grouping.
Repeat above procedure for all other ‘id’ groups.
(For rows where colA is non-blank, I can set the flag to what I need.)
Here is the data I have:
id colA ColB colC colD
1 10 1352.23 2
1 11 706.87 2
1 12 1116.6 2
1 13 0 2
1 14 0 2
1 15 0 2
2 2 6884.03 3
2 3 2235.97 3
2 4 3618.04 3
2 5 11745.42 3
3 2013 1 345.98 0
and here is what I would like to get after processing it.
id colA ColB colC colD flag
1 10 1352.23 2 0
1 11 706.87 2 0
1 12 1116.6 2 1
2 2 6884.03 3 0
2 3 2235.97 3 0
2 4 3618.04 3 0
2 5 11745.42 3 1
3 2013 1 345.98 0 0
The data contains many thousands of such groupings. I am hoping someone can help me in figuring out what the Python code to do the above processing would look like. I have a basic familiarity with the groupby function, but not to the extent to be able to figure out how to do the above.
Here is the code I am trying to use. The code give errors:
“AttributeError: ‘str’ object has no attribute ‘id’.”
I am trying to set the “flag” to NaN when I detect the zeros in colC that I eventually want to remove, so I can drop them easily, in a later step.
def setFlag(grouped):
for name, group in grouped:
for i in range(group.id.size):
drop_candidate = (
pd.isnull(group.iloc[i]['colA'])&
( (group.iloc[i]['colD'] == 2) |
(group.iloc[i]['colD'] == 3) |
(group.iloc[i]['colD'] == 4) )
)
last_nonZero = group[group != 0].index[-1]
if ( (drop_candidate & (group.iloc[i]['colC'] == 0)) ):
group['flag'] = np.nan
elif ((drop_candidate & (group.iloc[i]['colC'] != 0)) & (last_nonZero != i)):
group['flag'] = 0
elif last_nonZero == i:
group['flag'] = 1
return grouped
df.groupby('id').apply(setFlag)
Here is the code to re-create the test dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame.from_items([
('id', [1,1,1,1,1,1,2,2,2,2,3]),
('colA', [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2013]),
('colB', [10,11,12,13,14,15,2,3,4,5,1]),
('colC', [1352.23,706.87,1116.6,0,0,0,6884.03,2235.97,3618.04,11745.42,345.98]),
('colD', [2,2,2,2,2,2,3,3,3,3,0]),
('flag', [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,]),
])
Answers:
It looks like there are three parts to your process:
1) Get rid of rows where colA is null and colC == 0. Work on reducing your dataframe first
if it is AND logic:
reduced_df = df.loc[(df.colA.notnull()) & (df.colC != 0), :].copy()
if it is OR logic:
reduced_df = df.loc[(df.colA.notnull()) | (df.colC != 0), :].copy()
id colA colB colC colD flag
0 1 NaN 10 1352.23 2 NaN
1 1 NaN 11 706.87 2 NaN
2 1 NaN 12 1116.60 2 NaN
6 2 NaN 2 6884.03 3 NaN
7 2 NaN 3 2235.97 3 NaN
8 2 NaN 4 3618.04 3 NaN
9 2 NaN 5 11745.42 3 NaN
10 3 2013 1 345.98 0 NaN
2) Now you are ready to work on part two which is flagging the last column of a group. Since the default flag value is 0, start with that
reduced_df.loc[:, 'flag'] = 0
3) You can find duplicate values using duplicated
and then make sure colA is null
reduced_df.loc[~reduced_df.colD.duplicated(keep='last') & reduced_df.colA.isnull(), 'flag'] = 1
reduced_df
id colA colB colC colD flag
0 1 NaN 10 1352.23 2 0
1 1 NaN 11 706.87 2 0
2 1 NaN 12 1116.60 2 1
6 2 NaN 2 6884.03 3 0
7 2 NaN 3 2235.97 3 0
8 2 NaN 4 3618.04 3 0
9 2 NaN 5 11745.42 3 1
10 3 2013 1 345.98 0 0
This is what I came up with using the apply
method. I think it does what you are asking for:
df['flag'] = df['colD'].shift(-1) #use as a placeholder to compare consecutive 'colD' vals
df['flag'] = df.apply(lambda x: 1 if (x['flag']!=x['colD']) &
(np.isnan(x['colA'])) & (x['colD']>0) else 0, axis=1)
Please let me know if that works! (You’ll need to have numpy as np imported btw). Also, if you want to limit this to only cases of 2,3 & 4, you’ll have to change the last part from (x['colD']>0)
to be (x['colD']>1) & (x['colD'] < 5)
def function1(dd:pd.DataFrame):
dd1=dd.loc[~(pd.isna(dd.colA)&dd.colC.eq(0))]
if dd1['colA'].isna().all()&dd1['colD'].isin([2,3,4]).all():
idx1=dd1.query("colC !=0").tail(1).index.tolist()
return dd1.assign(flag=np.where(dd1.index.isin(idx1),1,0))
return dd1.assign(flag=0)
df1.groupby('id').apply(function1)
out:
id colA colB colC colD flag
0 1 NaN 10 1352.23 2 0
1 1 NaN 11 706.87 2 0
2 1 NaN 12 1116.60 2 1
6 2 NaN 2 6884.03 3 0
7 2 NaN 3 2235.97 3 0
8 2 NaN 4 3618.04 3 0
9 2 NaN 5 11745.42 3 1
10 3 2013 1 345.98 0 0
My data is grouped by id. In each group, it is sorted by colB. The logic I need to implement is as follows:
If colA is blank, and colD is either (2,3, or 4),
then create a column called ‘flag’ and set flag = 1 in the last non-zero row of colC. Set the flag to 0 in all the other rows of that group, where colC is non-zero.
Remove the rows where (colA is blank, and colC is 0) for that particular grouping.
Repeat above procedure for all other ‘id’ groups.
(For rows where colA is non-blank, I can set the flag to what I need.)
Here is the data I have:
id colA ColB colC colD
1 10 1352.23 2
1 11 706.87 2
1 12 1116.6 2
1 13 0 2
1 14 0 2
1 15 0 2
2 2 6884.03 3
2 3 2235.97 3
2 4 3618.04 3
2 5 11745.42 3
3 2013 1 345.98 0
and here is what I would like to get after processing it.
id colA ColB colC colD flag
1 10 1352.23 2 0
1 11 706.87 2 0
1 12 1116.6 2 1
2 2 6884.03 3 0
2 3 2235.97 3 0
2 4 3618.04 3 0
2 5 11745.42 3 1
3 2013 1 345.98 0 0
The data contains many thousands of such groupings. I am hoping someone can help me in figuring out what the Python code to do the above processing would look like. I have a basic familiarity with the groupby function, but not to the extent to be able to figure out how to do the above.
Here is the code I am trying to use. The code give errors:
“AttributeError: ‘str’ object has no attribute ‘id’.”
I am trying to set the “flag” to NaN when I detect the zeros in colC that I eventually want to remove, so I can drop them easily, in a later step.
def setFlag(grouped):
for name, group in grouped:
for i in range(group.id.size):
drop_candidate = (
pd.isnull(group.iloc[i]['colA'])&
( (group.iloc[i]['colD'] == 2) |
(group.iloc[i]['colD'] == 3) |
(group.iloc[i]['colD'] == 4) )
)
last_nonZero = group[group != 0].index[-1]
if ( (drop_candidate & (group.iloc[i]['colC'] == 0)) ):
group['flag'] = np.nan
elif ((drop_candidate & (group.iloc[i]['colC'] != 0)) & (last_nonZero != i)):
group['flag'] = 0
elif last_nonZero == i:
group['flag'] = 1
return grouped
df.groupby('id').apply(setFlag)
Here is the code to re-create the test dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame.from_items([
('id', [1,1,1,1,1,1,2,2,2,2,3]),
('colA', [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2013]),
('colB', [10,11,12,13,14,15,2,3,4,5,1]),
('colC', [1352.23,706.87,1116.6,0,0,0,6884.03,2235.97,3618.04,11745.42,345.98]),
('colD', [2,2,2,2,2,2,3,3,3,3,0]),
('flag', [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,]),
])
It looks like there are three parts to your process:
1) Get rid of rows where colA is null and colC == 0. Work on reducing your dataframe first
if it is AND logic:
reduced_df = df.loc[(df.colA.notnull()) & (df.colC != 0), :].copy()
if it is OR logic:
reduced_df = df.loc[(df.colA.notnull()) | (df.colC != 0), :].copy()
id colA colB colC colD flag
0 1 NaN 10 1352.23 2 NaN
1 1 NaN 11 706.87 2 NaN
2 1 NaN 12 1116.60 2 NaN
6 2 NaN 2 6884.03 3 NaN
7 2 NaN 3 2235.97 3 NaN
8 2 NaN 4 3618.04 3 NaN
9 2 NaN 5 11745.42 3 NaN
10 3 2013 1 345.98 0 NaN
2) Now you are ready to work on part two which is flagging the last column of a group. Since the default flag value is 0, start with that
reduced_df.loc[:, 'flag'] = 0
3) You can find duplicate values using duplicated
and then make sure colA is null
reduced_df.loc[~reduced_df.colD.duplicated(keep='last') & reduced_df.colA.isnull(), 'flag'] = 1
reduced_df
id colA colB colC colD flag
0 1 NaN 10 1352.23 2 0
1 1 NaN 11 706.87 2 0
2 1 NaN 12 1116.60 2 1
6 2 NaN 2 6884.03 3 0
7 2 NaN 3 2235.97 3 0
8 2 NaN 4 3618.04 3 0
9 2 NaN 5 11745.42 3 1
10 3 2013 1 345.98 0 0
This is what I came up with using the apply
method. I think it does what you are asking for:
df['flag'] = df['colD'].shift(-1) #use as a placeholder to compare consecutive 'colD' vals
df['flag'] = df.apply(lambda x: 1 if (x['flag']!=x['colD']) &
(np.isnan(x['colA'])) & (x['colD']>0) else 0, axis=1)
Please let me know if that works! (You’ll need to have numpy as np imported btw). Also, if you want to limit this to only cases of 2,3 & 4, you’ll have to change the last part from (x['colD']>0)
to be (x['colD']>1) & (x['colD'] < 5)
def function1(dd:pd.DataFrame):
dd1=dd.loc[~(pd.isna(dd.colA)&dd.colC.eq(0))]
if dd1['colA'].isna().all()&dd1['colD'].isin([2,3,4]).all():
idx1=dd1.query("colC !=0").tail(1).index.tolist()
return dd1.assign(flag=np.where(dd1.index.isin(idx1),1,0))
return dd1.assign(flag=0)
df1.groupby('id').apply(function1)
out:
id colA colB colC colD flag
0 1 NaN 10 1352.23 2 0
1 1 NaN 11 706.87 2 0
2 1 NaN 12 1116.60 2 1
6 2 NaN 2 6884.03 3 0
7 2 NaN 3 2235.97 3 0
8 2 NaN 4 3618.04 3 0
9 2 NaN 5 11745.42 3 1
10 3 2013 1 345.98 0 0