Pandas Groupby Operation For Condition Based Feature Creation

Question:

Having difficulties to create a feature based on the some groupby + conditions

The data that I’ve looks similar to

ir_id pli pli_missing err_type
0 name1 1.0 no UNKNOWN
1 name1 2.0 no NaN
2 name1 3.0 no NaN
3 name1 NaN yes UNKNOWN
4 name2 4.0 no NaN
5 name2 5.0 no NaN
6 name2 NaN yes UNKNOWN
7 name3 6.0 no NaN
8 name3 7.0 no NaN
9 name3 8.0 no NaN
10 name3 9.0 no UNKNOWN
11 name4 10.0 no NaN
12 name4 11.0 no NaN
13 name4 12.0 no NaN
14 name5 NaN yes UNKNOWN
15 name5 NaN yes UNKNOWN
16 name5 NaN yes UNKNOWN
17 name5 NaN yes UNKNOWN

I want to groupby at ir_id such that I can create err_flag column which is:

  • type1: atleast 1 row having value "UNKNOWN" in err_type column, and also "yes" in pli_missing
ir_id pli pli_missing err_type err_flag
4 name2 4.0 no NaN type1
5 name2 5.0 no NaN type1
6 name2 NaN yes UNKNOWN type1
ir_id pli pli_missing err_type err_flag
14 name5 NaN yes UNKNOWN type1
15 name5 NaN yes UNKNOWN type1
16 name5 NaN yes UNKNOWN type1
17 name5 NaN yes UNKNOWN type1
  • type2: atleast 1 row having value "UNKNOWN" in err_type column, and also "no" in pli_missing
ir_id pli pli_missing err_type err_flag
7 name3 6.0 no NaN type2
8 name3 7.0 no NaN type2
9 name3 8.0 no NaN type2
10 name3 9.0 no UNKNOWN type2
  • type3: no row having value "UNKNOWN" in err_type column, and also "no" in pli_missing
ir_id pli pli_missing err_type err_flag
11 name4 10.0 no NaN type3
12 name4 11.0 no NaN type3
13 name4 12.0 no NaN type3
  • both_type: both type1 and type2 error flag, i.e.
ir_id pli pli_missing err_type err_flag
0 name1 1.0 no UNKNOWN both_type
1 name1 2.0 no NaN both_type
2 name1 3.0 no NaN both_type
3 name1 NaN yes UNKNOWN both_type

Which results in final O/p as:

ir_id pli pli_missing err_type err_flag
0 name1 1.0 no UNKNOWN both_type
1 name1 2.0 no NaN both_type
2 name1 3.0 no NaN both_type
3 name1 NaN yes UNKNOWN both_type
4 name2 4.0 no NaN type1
5 name2 5.0 no NaN type1
6 name2 NaN yes UNKNOWN type1
7 name3 6.0 no NaN type2
8 name3 7.0 no NaN type2
9 name3 8.0 no NaN type2
10 name3 9.0 no UNKNOWN type2
11 name4 10.0 no NaN type3
12 name4 11.0 no NaN type3
13 name4 12.0 no NaN type3
14 name5 NaN yes UNKNOWN type1
15 name5 NaN yes UNKNOWN type1
16 name5 NaN yes UNKNOWN type1
17 name5 NaN yes UNKNOWN type1

dataset used:

custom_df = pd.DataFrame.from_dict({
    'ir_id':['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name3', 'name3', 'name3', 'name3', 'name4', 'name4', 'name4', 'name5', 'name5', 'name5', 'name5']
    , 'pli': [1,       2,       3,       np.nan,  4,       5,       np.nan,  6,       7,       8,       9,       10,      11,      12,      np.nan,  np.nan,  np.nan,  np.nan]
    , 'pli_missing': ["no","no","no","yes","no","no","yes","no","no","no","no","no","no","no","yes","yes","yes","yes"]
    , 'err_type': ["UNKNOWN",np.nan,np.nan,"UNKNOWN",np.nan,np.nan,"UNKNOWN",np.nan,np.nan,np.nan,"UNKNOWN",np.nan,np.nan,np.nan,"UNKNOWN","UNKNOWN","UNKNOWN","UNKNOWN"]
    , 'err_flag': ["both_type", "both_type", "both_type", "both_type", "type1", "type1", "type1", "type2", "type2", "type2", "type2", "type3", "type3", "type3", "type1", "type1", "type1", "type1"]
})

custom_df

PS
Earlier solution can’t handle cases for ir_id = name5

Asked By: fast_crawler

||

Answers:

I’ve found a solution. I trust someone else could find a better one, but it seems to arrive at the requested result correctly.

import pandas as pd
import numpy as np

df = custom_df.copy()

condlist = [((df['err_type'] == 'UNKNOWN') & (df['pli_missing'] == 'yes')),
            ((df['err_type'] == 'UNKNOWN') & (df['pli_missing'] == 'no'))]

choicelist = ['type1','type2']

df['err_flag'] = np.select(condlist, choicelist, default='type3')

s = df[df['err_flag'] != 'type3'].groupby('ir_id')['err_flag'].nunique().gt(1)
df.loc[df['ir_id'].isin(s[s].index),'err_flag'] = 'both_type'

df['err_flag'] = df.sort_values(['ir_id','err_flag']).groupby(['ir_id'])
    ['err_flag'].transform('first')

print(df[['ir_id','err_flag']].drop_duplicates())

    ir_id   err_flag
0   name1  both_type
4   name2      type1
7   name3      type2
11  name4      type3
14  name5      type1

print(df.equals(custom_df))
# True

Explanation steps:

  • We start with np.select. In the condlist we store the conditions for type1 and type2, we let type3 be the default (i.e. neither condition is met), and assign the result to a new column, err_flag.
  • Next, we select from the df only the rows that do not contain type3 in the new column and use df.groubpy and then .nunique to get a count of the unique values for each group (minus a potential type3 of course).
  • Now, any of our filtered groups can only have 2 unique values (type1 and type2) or just 1. So, we want to filter using .gt(1) to get a boolean Series that will have True only for the groups that have both. These are the ones we want to overwrite with both_type.
  • To overwrite the groups mentioned, we filter the df again based on the aforementioned Series and assign both_type.
  • In the final step, we sort the df on ['ir_id','err_flag'] and group by ir_id. Now, we can ask for the first value from each group and we wamt to assign this value to the entire group (hence, the use of .transform), thus overwriting any type3 values that are left (i.e. both type1 and type2 will come first, because of the sort).
Answered By: ouroboros1

I think we could apply here sort of categorical logic. Here’s what I mean.

Let’s say ['type3','type1','type2','both_types'] are verbal representatives of codes [0, 1, 2, 3]. Why this order? As I can see, type3 is sort of a default value. type1 and type2 are equal by nature, but we have somehow to differ them. So let’s follow their names and say that type1 is 1 and type2 is 2. And both_types has index 3 as a sum of previos two codes. Now we can separate identifying if a record can be of type 1 or 2, and get their sum as a final output. If a record is neither type 1 nor type 2 kind of error, the sum will result in 0. If there’s only one of them, the sum will keep it as is. If a record can be of both types, the sum will show 3.

Let’s see how it looks in code:

err = custom_df['err_type'] == 'UNKNOWN'
pli = custom_df['pli_missing'] == 'yes'
grouper = custom_df['ir_id']

# with transform get the same dimension as custom_df
type1 = (err & pli).groupby(grouper).transform(any)
type2 = (err & ~pli).groupby(grouper).transform(any)

codes = type1 + 2*type2
categories = ['type3','type1','type2','both_types']
custom_df['err_flag_new'] = pd.Categorical.from_codes(codes, categories)

Here’s what I’ve got in the end:

image


update

We can look at it this way. Suppose there’s a system with some number of independent states. A system can be described as having any combination of them. It can be described mathematically as a binary code. For each independant state we assing a unique place in this code, where values 1 or 0 are interpreted as a logical answer whether the system has a corresponding state. Binary means a sum like this one:

state[0]*2^0 + state[1]*2^1 + state[2]*2^2 + state[3]*2^3 + ...

In our case we have only 2 independent states type1 and type2. Two others are their combinations: type3 means neather of them, and both_types tells for itself. So we have only first two terms of the sum above, where state[0] is a logical value for type1 and state[1] is a logical value for type2. That’s why I used codes = type1 + 2*type2 which is equal to codes = type1 * 2**0 + type2 * 2**1

As for the order in ['type3','type1','type2','both_types'], in this list indexes of values resemble the corresponding codes, ie. each type here has the index equal to their binary code. The binary code for type3 in this model is 0b00 which is zero, for type1 it is 0b01 which is one, for type20b10 which is 2 and for both_types it’s 0b11 which is equal to 3. These codes are atomatically assigned when creating Categorical sequence from codes, ie. pandas is using codes as indexes to get corresponding values from the list and place them instead of the codes.

See also Enum.IntFlag as an abstract realization of this idea, and Flags in the regular expression module as an example of how it can be used.

Answered By: Vitalizzare