Pandas Groupby Operation For Condition Based Feature Creation

Question

Having difficulties to create a feature based on the some groupby + conditions

The data that I’ve looks similar to

	ir_id	pli	pli_missing	err_type
0	name1	1.0	no	UNKNOWN
1	name1	2.0	no	NaN
2	name1	3.0	no	NaN
3	name1	NaN	yes	UNKNOWN
4	name2	4.0	no	NaN
5	name2	5.0	no	NaN
6	name2	NaN	yes	UNKNOWN
7	name3	6.0	no	NaN
8	name3	7.0	no	NaN
9	name3	8.0	no	NaN
10	name3	9.0	no	UNKNOWN
11	name4	10.0	no	NaN
12	name4	11.0	no	NaN
13	name4	12.0	no	NaN
14	name5	NaN	yes	UNKNOWN
15	name5	NaN	yes	UNKNOWN
16	name5	NaN	yes	UNKNOWN
17	name5	NaN	yes	UNKNOWN

I want to groupby at ir_id such that I can create err_flag column which is:

type1: atleast 1 row having value "UNKNOWN" in err_type column, and also "yes" in pli_missing

	ir_id	pli	pli_missing	err_type	err_flag
4	name2	4.0	no	NaN	type1
5	name2	5.0	no	NaN	type1
6	name2	NaN	yes	UNKNOWN	type1

	ir_id	pli	pli_missing	err_type	err_flag
14	name5	NaN	yes	UNKNOWN	type1
15	name5	NaN	yes	UNKNOWN	type1
16	name5	NaN	yes	UNKNOWN	type1
17	name5	NaN	yes	UNKNOWN	type1

type2: atleast 1 row having value "UNKNOWN" in err_type column, and also "no" in pli_missing

	ir_id	pli	pli_missing	err_type	err_flag
7	name3	6.0	no	NaN	type2
8	name3	7.0	no	NaN	type2
9	name3	8.0	no	NaN	type2
10	name3	9.0	no	UNKNOWN	type2

type3: no row having value "UNKNOWN" in err_type column, and also "no" in pli_missing

	ir_id	pli	pli_missing	err_type	err_flag
11	name4	10.0	no	NaN	type3
12	name4	11.0	no	NaN	type3
13	name4	12.0	no	NaN	type3

both_type: both type1 and type2 error flag, i.e.

	ir_id	pli	pli_missing	err_type	err_flag
0	name1	1.0	no	UNKNOWN	both_type
1	name1	2.0	no	NaN	both_type
2	name1	3.0	no	NaN	both_type
3	name1	NaN	yes	UNKNOWN	both_type

Which results in final O/p as:

	ir_id	pli	pli_missing	err_type	err_flag
0	name1	1.0	no	UNKNOWN	both_type
1	name1	2.0	no	NaN	both_type
2	name1	3.0	no	NaN	both_type
3	name1	NaN	yes	UNKNOWN	both_type
4	name2	4.0	no	NaN	type1
5	name2	5.0	no	NaN	type1
6	name2	NaN	yes	UNKNOWN	type1
7	name3	6.0	no	NaN	type2
8	name3	7.0	no	NaN	type2
9	name3	8.0	no	NaN	type2
10	name3	9.0	no	UNKNOWN	type2
11	name4	10.0	no	NaN	type3
12	name4	11.0	no	NaN	type3
13	name4	12.0	no	NaN	type3
14	name5	NaN	yes	UNKNOWN	type1
15	name5	NaN	yes	UNKNOWN	type1
16	name5	NaN	yes	UNKNOWN	type1
17	name5	NaN	yes	UNKNOWN	type1

dataset used:

custom_df = pd.DataFrame.from_dict({
    'ir_id':['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name3', 'name3', 'name3', 'name3', 'name4', 'name4', 'name4', 'name5', 'name5', 'name5', 'name5']
    , 'pli': [1,       2,       3,       np.nan,  4,       5,       np.nan,  6,       7,       8,       9,       10,      11,      12,      np.nan,  np.nan,  np.nan,  np.nan]
    , 'pli_missing': ["no","no","no","yes","no","no","yes","no","no","no","no","no","no","no","yes","yes","yes","yes"]
    , 'err_type': ["UNKNOWN",np.nan,np.nan,"UNKNOWN",np.nan,np.nan,"UNKNOWN",np.nan,np.nan,np.nan,"UNKNOWN",np.nan,np.nan,np.nan,"UNKNOWN","UNKNOWN","UNKNOWN","UNKNOWN"]
    , 'err_flag': ["both_type", "both_type", "both_type", "both_type", "type1", "type1", "type1", "type2", "type2", "type2", "type2", "type3", "type3", "type3", "type1", "type1", "type1", "type1"]
})

custom_df

PS
Earlier solution can’t handle cases for ir_id = name5

Asked By: fast_crawler

||

Source

Answer 1

I’ve found a solution. I trust someone else could find a better one, but it seems to arrive at the requested result correctly.

import pandas as pd
import numpy as np

df = custom_df.copy()

condlist = [((df['err_type'] == 'UNKNOWN') & (df['pli_missing'] == 'yes')),
            ((df['err_type'] == 'UNKNOWN') & (df['pli_missing'] == 'no'))]

choicelist = ['type1','type2']

df['err_flag'] = np.select(condlist, choicelist, default='type3')

s = df[df['err_flag'] != 'type3'].groupby('ir_id')['err_flag'].nunique().gt(1)
df.loc[df['ir_id'].isin(s[s].index),'err_flag'] = 'both_type'

df['err_flag'] = df.sort_values(['ir_id','err_flag']).groupby(['ir_id'])
    ['err_flag'].transform('first')

print(df[['ir_id','err_flag']].drop_duplicates())

    ir_id   err_flag
0   name1  both_type
4   name2      type1
7   name3      type2
11  name4      type3
14  name5      type1

print(df.equals(custom_df))
# True

Explanation steps:

We start with np.select. In the condlist we store the conditions for type1 and type2, we let type3 be the default (i.e. neither condition is met), and assign the result to a new column, err_flag.
Next, we select from the df only the rows that do not contain type3 in the new column and use df.groubpy and then .nunique to get a count of the unique values for each group (minus a potential type3 of course).
Now, any of our filtered groups can only have 2 unique values (type1 and type2) or just 1. So, we want to filter using .gt(1) to get a boolean Series that will have True only for the groups that have both. These are the ones we want to overwrite with both_type.
To overwrite the groups mentioned, we filter the df again based on the aforementioned Series and assign both_type.
In the final step, we sort the df on ['ir_id','err_flag'] and group by ir_id. Now, we can ask for the first value from each group and we wamt to assign this value to the entire group (hence, the use of .transform), thus overwriting any type3 values that are left (i.e. both type1 and type2 will come first, because of the sort).

Answered By: ouroboros1

Answer 2

I think we could apply here sort of categorical logic. Here’s what I mean.

Let’s say ['type3','type1','type2','both_types'] are verbal representatives of codes [0, 1, 2, 3]. Why this order? As I can see, type3 is sort of a default value. type1 and type2 are equal by nature, but we have somehow to differ them. So let’s follow their names and say that type1 is 1 and type2 is 2. And both_types has index 3 as a sum of previos two codes. Now we can separate identifying if a record can be of type 1 or 2, and get their sum as a final output. If a record is neither type 1 nor type 2 kind of error, the sum will result in 0. If there’s only one of them, the sum will keep it as is. If a record can be of both types, the sum will show 3.

Let’s see how it looks in code:

err = custom_df['err_type'] == 'UNKNOWN'
pli = custom_df['pli_missing'] == 'yes'
grouper = custom_df['ir_id']

# with transform get the same dimension as custom_df
type1 = (err & pli).groupby(grouper).transform(any)
type2 = (err & ~pli).groupby(grouper).transform(any)

codes = type1 + 2*type2
categories = ['type3','type1','type2','both_types']
custom_df['err_flag_new'] = pd.Categorical.from_codes(codes, categories)

Here’s what I’ve got in the end:

update

We can look at it this way. Suppose there’s a system with some number of independent states. A system can be described as having any combination of them. It can be described mathematically as a binary code. For each independant state we assing a unique place in this code, where values 1 or 0 are interpreted as a logical answer whether the system has a corresponding state. Binary means a sum like this one:

state[0]*2^0 + state[1]*2^1 + state[2]*2^2 + state[3]*2^3 + ...

In our case we have only 2 independent states type1 and type2. Two others are their combinations: type3 means neather of them, and both_types tells for itself. So we have only first two terms of the sum above, where state[0] is a logical value for type1 and state[1] is a logical value for type2. That’s why I used codes = type1 + 2*type2 which is equal to codes = type1 * 2**0 + type2 * 2**1

As for the order in ['type3','type1','type2','both_types'], in this list indexes of values resemble the corresponding codes, ie. each type here has the index equal to their binary code. The binary code for type3 in this model is 0b00 which is zero, for type1 it is 0b01 which is one, for type2 – 0b10 which is 2 and for both_types it’s 0b11 which is equal to 3. These codes are atomatically assigned when creating Categorical sequence from codes, ie. pandas is using codes as indexes to get corresponding values from the list and place them instead of the codes.

See also Enum.IntFlag as an abstract realization of this idea, and Flags in the regular expression module as an example of how it can be used.

Answered By: Vitalizzare

Pandas Groupby Operation For Condition Based Feature Creation

Question:

Answers: