Pandas Groupby Operation For Condition Based Feature Creation
Question:
Having difficulties to create a feature based on the some groupby + conditions
The data that I’ve looks similar to
ir_id
pli
pli_missing
err_type
0
name1
1.0
no
UNKNOWN
1
name1
2.0
no
NaN
2
name1
3.0
no
NaN
3
name1
NaN
yes
UNKNOWN
4
name2
4.0
no
NaN
5
name2
5.0
no
NaN
6
name2
NaN
yes
UNKNOWN
7
name3
6.0
no
NaN
8
name3
7.0
no
NaN
9
name3
8.0
no
NaN
10
name3
9.0
no
UNKNOWN
11
name4
10.0
no
NaN
12
name4
11.0
no
NaN
13
name4
12.0
no
NaN
14
name5
NaN
yes
UNKNOWN
15
name5
NaN
yes
UNKNOWN
16
name5
NaN
yes
UNKNOWN
17
name5
NaN
yes
UNKNOWN
I want to groupby at ir_id such that I can create err_flag
column which is:
- type1: atleast 1 row having value "UNKNOWN" in
err_type
column, and also "yes" in pli_missing
ir_id
pli
pli_missing
err_type
err_flag
4
name2
4.0
no
NaN
type1
5
name2
5.0
no
NaN
type1
6
name2
NaN
yes
UNKNOWN
type1
ir_id
pli
pli_missing
err_type
err_flag
14
name5
NaN
yes
UNKNOWN
type1
15
name5
NaN
yes
UNKNOWN
type1
16
name5
NaN
yes
UNKNOWN
type1
17
name5
NaN
yes
UNKNOWN
type1
- type2: atleast 1 row having value "UNKNOWN" in
err_type
column, and also "no" in pli_missing
ir_id
pli
pli_missing
err_type
err_flag
7
name3
6.0
no
NaN
type2
8
name3
7.0
no
NaN
type2
9
name3
8.0
no
NaN
type2
10
name3
9.0
no
UNKNOWN
type2
- type3: no row having value "UNKNOWN" in
err_type
column, and also "no" in pli_missing
ir_id
pli
pli_missing
err_type
err_flag
11
name4
10.0
no
NaN
type3
12
name4
11.0
no
NaN
type3
13
name4
12.0
no
NaN
type3
- both_type: both type1 and type2 error flag, i.e.
ir_id
pli
pli_missing
err_type
err_flag
0
name1
1.0
no
UNKNOWN
both_type
1
name1
2.0
no
NaN
both_type
2
name1
3.0
no
NaN
both_type
3
name1
NaN
yes
UNKNOWN
both_type
Which results in final O/p as:
ir_id
pli
pli_missing
err_type
err_flag
0
name1
1.0
no
UNKNOWN
both_type
1
name1
2.0
no
NaN
both_type
2
name1
3.0
no
NaN
both_type
3
name1
NaN
yes
UNKNOWN
both_type
4
name2
4.0
no
NaN
type1
5
name2
5.0
no
NaN
type1
6
name2
NaN
yes
UNKNOWN
type1
7
name3
6.0
no
NaN
type2
8
name3
7.0
no
NaN
type2
9
name3
8.0
no
NaN
type2
10
name3
9.0
no
UNKNOWN
type2
11
name4
10.0
no
NaN
type3
12
name4
11.0
no
NaN
type3
13
name4
12.0
no
NaN
type3
14
name5
NaN
yes
UNKNOWN
type1
15
name5
NaN
yes
UNKNOWN
type1
16
name5
NaN
yes
UNKNOWN
type1
17
name5
NaN
yes
UNKNOWN
type1
dataset used:
custom_df = pd.DataFrame.from_dict({
'ir_id':['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name3', 'name3', 'name3', 'name3', 'name4', 'name4', 'name4', 'name5', 'name5', 'name5', 'name5']
, 'pli': [1, 2, 3, np.nan, 4, 5, np.nan, 6, 7, 8, 9, 10, 11, 12, np.nan, np.nan, np.nan, np.nan]
, 'pli_missing': ["no","no","no","yes","no","no","yes","no","no","no","no","no","no","no","yes","yes","yes","yes"]
, 'err_type': ["UNKNOWN",np.nan,np.nan,"UNKNOWN",np.nan,np.nan,"UNKNOWN",np.nan,np.nan,np.nan,"UNKNOWN",np.nan,np.nan,np.nan,"UNKNOWN","UNKNOWN","UNKNOWN","UNKNOWN"]
, 'err_flag': ["both_type", "both_type", "both_type", "both_type", "type1", "type1", "type1", "type2", "type2", "type2", "type2", "type3", "type3", "type3", "type1", "type1", "type1", "type1"]
})
custom_df
PS
Earlier solution can’t handle cases for ir_id = name5
Answers:
I’ve found a solution. I trust someone else could find a better one, but it seems to arrive at the requested result correctly.
import pandas as pd
import numpy as np
df = custom_df.copy()
condlist = [((df['err_type'] == 'UNKNOWN') & (df['pli_missing'] == 'yes')),
((df['err_type'] == 'UNKNOWN') & (df['pli_missing'] == 'no'))]
choicelist = ['type1','type2']
df['err_flag'] = np.select(condlist, choicelist, default='type3')
s = df[df['err_flag'] != 'type3'].groupby('ir_id')['err_flag'].nunique().gt(1)
df.loc[df['ir_id'].isin(s[s].index),'err_flag'] = 'both_type'
df['err_flag'] = df.sort_values(['ir_id','err_flag']).groupby(['ir_id'])
['err_flag'].transform('first')
print(df[['ir_id','err_flag']].drop_duplicates())
ir_id err_flag
0 name1 both_type
4 name2 type1
7 name3 type2
11 name4 type3
14 name5 type1
print(df.equals(custom_df))
# True
Explanation steps:
- We start with
np.select
. In the condlist
we store the conditions for type1
and type2
, we let type3
be the default (i.e. neither condition is met), and assign the result to a new column, err_flag
.
- Next, we select from the
df
only the rows that do not contain type3
in the new column and use df.groubpy
and then .nunique
to get a count of the unique values for each group (minus a potential type3
of course).
- Now, any of our filtered groups can only have 2 unique values (
type1
and type2
) or just 1. So, we want to filter using .gt(1)
to get a boolean Series
that will have True
only for the groups that have both. These are the ones we want to overwrite with both_type
.
- To overwrite the groups mentioned, we filter the
df
again based on the aforementioned Series
and assign both_type
.
- In the final step, we sort the
df
on ['ir_id','err_flag']
and group by ir_id
. Now, we can ask for the first
value from each group and we wamt to assign this value to the entire group (hence, the use of .transform
), thus overwriting any type3
values that are left (i.e. both type1
and type2
will come first, because of the sort).
I think we could apply here sort of categorical logic. Here’s what I mean.
Let’s say ['type3','type1','type2','both_types']
are verbal representatives of codes [0, 1, 2, 3]
. Why this order? As I can see, type3
is sort of a default value. type1
and type2
are equal by nature, but we have somehow to differ them. So let’s follow their names and say that type1
is 1 and type2
is 2. And both_types
has index 3 as a sum of previos two codes. Now we can separate identifying if a record can be of type 1 or 2, and get their sum as a final output. If a record is neither type 1 nor type 2 kind of error, the sum will result in 0. If there’s only one of them, the sum will keep it as is. If a record can be of both types, the sum will show 3.
Let’s see how it looks in code:
err = custom_df['err_type'] == 'UNKNOWN'
pli = custom_df['pli_missing'] == 'yes'
grouper = custom_df['ir_id']
# with transform get the same dimension as custom_df
type1 = (err & pli).groupby(grouper).transform(any)
type2 = (err & ~pli).groupby(grouper).transform(any)
codes = type1 + 2*type2
categories = ['type3','type1','type2','both_types']
custom_df['err_flag_new'] = pd.Categorical.from_codes(codes, categories)
Here’s what I’ve got in the end:
update
We can look at it this way. Suppose there’s a system with some number of independent states. A system can be described as having any combination of them. It can be described mathematically as a binary code. For each independant state we assing a unique place in this code, where values 1 or 0 are interpreted as a logical answer whether the system has a corresponding state. Binary means a sum like this one:
state[0]*2^0 + state[1]*2^1 + state[2]*2^2 + state[3]*2^3 + ...
In our case we have only 2 independent states type1
and type2
. Two others are their combinations: type3
means neather of them, and both_types
tells for itself. So we have only first two terms of the sum above, where state[0]
is a logical value for type1
and state[1]
is a logical value for type2
. That’s why I used codes = type1 + 2*type2
which is equal to codes = type1 * 2**0 + type2 * 2**1
As for the order in ['type3','type1','type2','both_types']
, in this list indexes of values resemble the corresponding codes, ie. each type here has the index equal to their binary code. The binary code for type3
in this model is 0b00
which is zero, for type1
it is 0b01
which is one, for type2
– 0b10
which is 2 and for both_types
it’s 0b11
which is equal to 3. These codes are atomatically assigned when creating Categorical
sequence from codes
, ie. pandas is using codes as indexes to get corresponding values from the list and place them instead of the codes.
See also Enum.IntFlag as an abstract realization of this idea, and Flags in the regular expression module as an example of how it can be used.
Having difficulties to create a feature based on the some groupby + conditions
The data that I’ve looks similar to
ir_id | pli | pli_missing | err_type | |
---|---|---|---|---|
0 | name1 | 1.0 | no | UNKNOWN |
1 | name1 | 2.0 | no | NaN |
2 | name1 | 3.0 | no | NaN |
3 | name1 | NaN | yes | UNKNOWN |
4 | name2 | 4.0 | no | NaN |
5 | name2 | 5.0 | no | NaN |
6 | name2 | NaN | yes | UNKNOWN |
7 | name3 | 6.0 | no | NaN |
8 | name3 | 7.0 | no | NaN |
9 | name3 | 8.0 | no | NaN |
10 | name3 | 9.0 | no | UNKNOWN |
11 | name4 | 10.0 | no | NaN |
12 | name4 | 11.0 | no | NaN |
13 | name4 | 12.0 | no | NaN |
14 | name5 | NaN | yes | UNKNOWN |
15 | name5 | NaN | yes | UNKNOWN |
16 | name5 | NaN | yes | UNKNOWN |
17 | name5 | NaN | yes | UNKNOWN |
I want to groupby at ir_id such that I can create err_flag
column which is:
- type1: atleast 1 row having value "UNKNOWN" in
err_type
column, and also "yes" inpli_missing
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
4 | name2 | 4.0 | no | NaN | type1 |
5 | name2 | 5.0 | no | NaN | type1 |
6 | name2 | NaN | yes | UNKNOWN | type1 |
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
14 | name5 | NaN | yes | UNKNOWN | type1 |
15 | name5 | NaN | yes | UNKNOWN | type1 |
16 | name5 | NaN | yes | UNKNOWN | type1 |
17 | name5 | NaN | yes | UNKNOWN | type1 |
- type2: atleast 1 row having value "UNKNOWN" in
err_type
column, and also "no" inpli_missing
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
7 | name3 | 6.0 | no | NaN | type2 |
8 | name3 | 7.0 | no | NaN | type2 |
9 | name3 | 8.0 | no | NaN | type2 |
10 | name3 | 9.0 | no | UNKNOWN | type2 |
- type3: no row having value "UNKNOWN" in
err_type
column, and also "no" inpli_missing
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
11 | name4 | 10.0 | no | NaN | type3 |
12 | name4 | 11.0 | no | NaN | type3 |
13 | name4 | 12.0 | no | NaN | type3 |
- both_type: both type1 and type2 error flag, i.e.
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
0 | name1 | 1.0 | no | UNKNOWN | both_type |
1 | name1 | 2.0 | no | NaN | both_type |
2 | name1 | 3.0 | no | NaN | both_type |
3 | name1 | NaN | yes | UNKNOWN | both_type |
Which results in final O/p as:
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
0 | name1 | 1.0 | no | UNKNOWN | both_type |
1 | name1 | 2.0 | no | NaN | both_type |
2 | name1 | 3.0 | no | NaN | both_type |
3 | name1 | NaN | yes | UNKNOWN | both_type |
4 | name2 | 4.0 | no | NaN | type1 |
5 | name2 | 5.0 | no | NaN | type1 |
6 | name2 | NaN | yes | UNKNOWN | type1 |
7 | name3 | 6.0 | no | NaN | type2 |
8 | name3 | 7.0 | no | NaN | type2 |
9 | name3 | 8.0 | no | NaN | type2 |
10 | name3 | 9.0 | no | UNKNOWN | type2 |
11 | name4 | 10.0 | no | NaN | type3 |
12 | name4 | 11.0 | no | NaN | type3 |
13 | name4 | 12.0 | no | NaN | type3 |
14 | name5 | NaN | yes | UNKNOWN | type1 |
15 | name5 | NaN | yes | UNKNOWN | type1 |
16 | name5 | NaN | yes | UNKNOWN | type1 |
17 | name5 | NaN | yes | UNKNOWN | type1 |
dataset used:
custom_df = pd.DataFrame.from_dict({
'ir_id':['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name3', 'name3', 'name3', 'name3', 'name4', 'name4', 'name4', 'name5', 'name5', 'name5', 'name5']
, 'pli': [1, 2, 3, np.nan, 4, 5, np.nan, 6, 7, 8, 9, 10, 11, 12, np.nan, np.nan, np.nan, np.nan]
, 'pli_missing': ["no","no","no","yes","no","no","yes","no","no","no","no","no","no","no","yes","yes","yes","yes"]
, 'err_type': ["UNKNOWN",np.nan,np.nan,"UNKNOWN",np.nan,np.nan,"UNKNOWN",np.nan,np.nan,np.nan,"UNKNOWN",np.nan,np.nan,np.nan,"UNKNOWN","UNKNOWN","UNKNOWN","UNKNOWN"]
, 'err_flag': ["both_type", "both_type", "both_type", "both_type", "type1", "type1", "type1", "type2", "type2", "type2", "type2", "type3", "type3", "type3", "type1", "type1", "type1", "type1"]
})
custom_df
PS
Earlier solution can’t handle cases for ir_id = name5
I’ve found a solution. I trust someone else could find a better one, but it seems to arrive at the requested result correctly.
import pandas as pd
import numpy as np
df = custom_df.copy()
condlist = [((df['err_type'] == 'UNKNOWN') & (df['pli_missing'] == 'yes')),
((df['err_type'] == 'UNKNOWN') & (df['pli_missing'] == 'no'))]
choicelist = ['type1','type2']
df['err_flag'] = np.select(condlist, choicelist, default='type3')
s = df[df['err_flag'] != 'type3'].groupby('ir_id')['err_flag'].nunique().gt(1)
df.loc[df['ir_id'].isin(s[s].index),'err_flag'] = 'both_type'
df['err_flag'] = df.sort_values(['ir_id','err_flag']).groupby(['ir_id'])
['err_flag'].transform('first')
print(df[['ir_id','err_flag']].drop_duplicates())
ir_id err_flag
0 name1 both_type
4 name2 type1
7 name3 type2
11 name4 type3
14 name5 type1
print(df.equals(custom_df))
# True
Explanation steps:
- We start with
np.select
. In thecondlist
we store the conditions fortype1
andtype2
, we lettype3
be the default (i.e. neither condition is met), and assign the result to a new column,err_flag
. - Next, we select from the
df
only the rows that do not containtype3
in the new column and usedf.groubpy
and then.nunique
to get a count of the unique values for each group (minus a potentialtype3
of course). - Now, any of our filtered groups can only have 2 unique values (
type1
andtype2
) or just 1. So, we want to filter using.gt(1)
to get a booleanSeries
that will haveTrue
only for the groups that have both. These are the ones we want to overwrite withboth_type
. - To overwrite the groups mentioned, we filter the
df
again based on the aforementionedSeries
and assignboth_type
. - In the final step, we sort the
df
on['ir_id','err_flag']
and group byir_id
. Now, we can ask for thefirst
value from each group and we wamt to assign this value to the entire group (hence, the use of.transform
), thus overwriting anytype3
values that are left (i.e. bothtype1
andtype2
will come first, because of the sort).
I think we could apply here sort of categorical logic. Here’s what I mean.
Let’s say ['type3','type1','type2','both_types']
are verbal representatives of codes [0, 1, 2, 3]
. Why this order? As I can see, type3
is sort of a default value. type1
and type2
are equal by nature, but we have somehow to differ them. So let’s follow their names and say that type1
is 1 and type2
is 2. And both_types
has index 3 as a sum of previos two codes. Now we can separate identifying if a record can be of type 1 or 2, and get their sum as a final output. If a record is neither type 1 nor type 2 kind of error, the sum will result in 0. If there’s only one of them, the sum will keep it as is. If a record can be of both types, the sum will show 3.
Let’s see how it looks in code:
err = custom_df['err_type'] == 'UNKNOWN'
pli = custom_df['pli_missing'] == 'yes'
grouper = custom_df['ir_id']
# with transform get the same dimension as custom_df
type1 = (err & pli).groupby(grouper).transform(any)
type2 = (err & ~pli).groupby(grouper).transform(any)
codes = type1 + 2*type2
categories = ['type3','type1','type2','both_types']
custom_df['err_flag_new'] = pd.Categorical.from_codes(codes, categories)
Here’s what I’ve got in the end:
update
We can look at it this way. Suppose there’s a system with some number of independent states. A system can be described as having any combination of them. It can be described mathematically as a binary code. For each independant state we assing a unique place in this code, where values 1 or 0 are interpreted as a logical answer whether the system has a corresponding state. Binary means a sum like this one:
state[0]*2^0 + state[1]*2^1 + state[2]*2^2 + state[3]*2^3 + ...
In our case we have only 2 independent states type1
and type2
. Two others are their combinations: type3
means neather of them, and both_types
tells for itself. So we have only first two terms of the sum above, where state[0]
is a logical value for type1
and state[1]
is a logical value for type2
. That’s why I used codes = type1 + 2*type2
which is equal to codes = type1 * 2**0 + type2 * 2**1
As for the order in ['type3','type1','type2','both_types']
, in this list indexes of values resemble the corresponding codes, ie. each type here has the index equal to their binary code. The binary code for type3
in this model is 0b00
which is zero, for type1
it is 0b01
which is one, for type2
– 0b10
which is 2 and for both_types
it’s 0b11
which is equal to 3. These codes are atomatically assigned when creating Categorical
sequence from codes
, ie. pandas is using codes as indexes to get corresponding values from the list and place them instead of the codes.
See also Enum.IntFlag as an abstract realization of this idea, and Flags in the regular expression module as an example of how it can be used.