Pandas update row of a group matching multiple conditions
Question:
new to pandas and first question on stakeoverflow (bare with me) : I have a df of individuals, sometimes regrouped under a family ID. The data of interest here is Gender and Status within the family as follows :
RowID
FamilyID
Status
Gender
1
Fam_1
head
undetermined
2
Fam_1
wife
female
3
Fam_1
child
undetermined
4
Fam_1
child
male
5
head
male
6
Fam_2
head
female
7
Fam_2
child
female
8
Fam_3
head
undetermined
9
Fam_3
wife
female
10
Fam_3
child
male
11
Fam_3
head
undetermined
Note: see row5 : some individuals are singles (no FamilyID), see Fam_3: some families have several heads (related adults).
Initially, I would need to create a new column Gender_Inferred where Gender_Inferred = male only for the heads of undetermined gender (row1) of families having a wife (in Status) and only one head (Fam_3 excluded because of row11).
I am able to create a mask for families with a wife as follows :
mask1 = df.groupby('FamilyID')['Status'].transform(lambda r: r.eq('wife').any())
a mask for the combined criteria head/undetermined to update:
mask2 = (df['Status'] == 'self') & (df['Gender'] == 'undeterminded')
I am then applying conditions with :
df['Gender_Inferred'] = np.nan
df['Gender_Inferred'] = np.where(mask1 & mask2, 'male', df['Gender'])
But I have not been able to create a mask3 for the condition ‘Family has only 1 Status=head and Gender=undetermined’. It is ‘almost’ as if one wanted to do
df.groupby('FamilyID')[['006_File1_Relation','004_File1_Gender']].transform(lambda r: (r[0].eq('self') & r[1].eq('undeterminded')).count() == 1)
but of course this isn’t proper code.
I would need to have:
RowID
FamilyID
Status
Gender
Gender_Inferred
1
Fam_1
head
undetermined
male
2
Fam_1
wife
female
female
3
Fam_1
child
undetermined
undetermined
4
Fam_1
child
male
male
5
head
undetermined
undetermined
6
Fam_2
head
female
female
7
Fam_2
child
female
female
8
Fam_3
head
undetermined
undetermined
9
Fam_3
wife
female
female
10
Fam_3
child
male
male
11
Fam_3
head
undetermined
undetermined
masking with groupby or updating with np.where (causing not matching length errors quite often) are not necessary mandatory, I would be happy with any working solution.
Thank you
Answers:
Sample input
df = pd.DataFrame([
[1, "Fam_1", "head", "undetermined"],
[2, "Fam_1", "wife", "female"],
[3, "Fam_1", "child", "undetermined"],
[4, "Fam_1", "child", "male"],
[5, np.NaN, "head", "male"],
[6, "Fam_2", "head", "female"],
[7, "Fam_2", "child", "female"],
[8, "Fam_3", "head", "undetermined"],
[9, "Fam_3", "wife", "female"],
[10, "Fam_3", "child", "male"],
[11, "Fam_3", "head", "undetermined"],
], columns=["RowID", "FamilyID", "Status", "Gender"])
Marking FamilyID – nans as Single
df.FamilyID.replace(np.NaN, "Single", inplace=True)
Calculating number of heads in the family
heads_df = df.loc[df.Status == "head"].groupby("FamilyID")["Status"].count().reset_index(name="HeadCount")
Merging the information back to original df
df = df.merge(heads_df, on="FamilyID", how="left")
Adding a new column using shift
df["NextMember" ] = df.Status.shift(-1)
With all the information in place, run the query and assign
df.loc[
(df.FamilyID != "Single")
& (df.Status == "head")
& (df.NextMember == "wife")
& (df.Gender == "undetermined")
& (df.HeadCount == 1)
, "Gender"] = "male"
Drop newly created columns
df.drop(["HeadCount", "NextMember"], inplace=True, axis=1)
output
RowID FamilyID Status Gender
0 1 Fam_1 head male
1 2 Fam_1 wife female
2 3 Fam_1 child undetermined
3 4 Fam_1 child male
4 5 Single head male
5 6 Fam_2 head female
6 7 Fam_2 child female
7 8 Fam_3 head undetermined
8 9 Fam_3 wife female
9 10 Fam_3 child male
10 11 Fam_3 head undetermined
Note: From the sample input given above, I assumed that the status == wife will be followed by status == head. If my assumption is erroneous do lemme know. the solution will not work in such case.
new to pandas and first question on stakeoverflow (bare with me) : I have a df of individuals, sometimes regrouped under a family ID. The data of interest here is Gender and Status within the family as follows :
RowID | FamilyID | Status | Gender |
---|---|---|---|
1 | Fam_1 | head | undetermined |
2 | Fam_1 | wife | female |
3 | Fam_1 | child | undetermined |
4 | Fam_1 | child | male |
5 | head | male | |
6 | Fam_2 | head | female |
7 | Fam_2 | child | female |
8 | Fam_3 | head | undetermined |
9 | Fam_3 | wife | female |
10 | Fam_3 | child | male |
11 | Fam_3 | head | undetermined |
Note: see row5 : some individuals are singles (no FamilyID), see Fam_3: some families have several heads (related adults).
Initially, I would need to create a new column Gender_Inferred where Gender_Inferred = male only for the heads of undetermined gender (row1) of families having a wife (in Status) and only one head (Fam_3 excluded because of row11).
I am able to create a mask for families with a wife as follows :
mask1 = df.groupby('FamilyID')['Status'].transform(lambda r: r.eq('wife').any())
a mask for the combined criteria head/undetermined to update:
mask2 = (df['Status'] == 'self') & (df['Gender'] == 'undeterminded')
I am then applying conditions with :
df['Gender_Inferred'] = np.nan
df['Gender_Inferred'] = np.where(mask1 & mask2, 'male', df['Gender'])
But I have not been able to create a mask3 for the condition ‘Family has only 1 Status=head and Gender=undetermined’. It is ‘almost’ as if one wanted to do
df.groupby('FamilyID')[['006_File1_Relation','004_File1_Gender']].transform(lambda r: (r[0].eq('self') & r[1].eq('undeterminded')).count() == 1)
but of course this isn’t proper code.
I would need to have:
RowID | FamilyID | Status | Gender | Gender_Inferred |
---|---|---|---|---|
1 | Fam_1 | head | undetermined | male |
2 | Fam_1 | wife | female | female |
3 | Fam_1 | child | undetermined | undetermined |
4 | Fam_1 | child | male | male |
5 | head | undetermined | undetermined | |
6 | Fam_2 | head | female | female |
7 | Fam_2 | child | female | female |
8 | Fam_3 | head | undetermined | undetermined |
9 | Fam_3 | wife | female | female |
10 | Fam_3 | child | male | male |
11 | Fam_3 | head | undetermined | undetermined |
masking with groupby or updating with np.where (causing not matching length errors quite often) are not necessary mandatory, I would be happy with any working solution.
Thank you
Sample input
df = pd.DataFrame([
[1, "Fam_1", "head", "undetermined"],
[2, "Fam_1", "wife", "female"],
[3, "Fam_1", "child", "undetermined"],
[4, "Fam_1", "child", "male"],
[5, np.NaN, "head", "male"],
[6, "Fam_2", "head", "female"],
[7, "Fam_2", "child", "female"],
[8, "Fam_3", "head", "undetermined"],
[9, "Fam_3", "wife", "female"],
[10, "Fam_3", "child", "male"],
[11, "Fam_3", "head", "undetermined"],
], columns=["RowID", "FamilyID", "Status", "Gender"])
Marking FamilyID – nans as Single
df.FamilyID.replace(np.NaN, "Single", inplace=True)
Calculating number of heads in the family
heads_df = df.loc[df.Status == "head"].groupby("FamilyID")["Status"].count().reset_index(name="HeadCount")
Merging the information back to original df
df = df.merge(heads_df, on="FamilyID", how="left")
Adding a new column using shift
df["NextMember" ] = df.Status.shift(-1)
With all the information in place, run the query and assign
df.loc[
(df.FamilyID != "Single")
& (df.Status == "head")
& (df.NextMember == "wife")
& (df.Gender == "undetermined")
& (df.HeadCount == 1)
, "Gender"] = "male"
Drop newly created columns
df.drop(["HeadCount", "NextMember"], inplace=True, axis=1)
output
RowID FamilyID Status Gender
0 1 Fam_1 head male
1 2 Fam_1 wife female
2 3 Fam_1 child undetermined
3 4 Fam_1 child male
4 5 Single head male
5 6 Fam_2 head female
6 7 Fam_2 child female
7 8 Fam_3 head undetermined
8 9 Fam_3 wife female
9 10 Fam_3 child male
10 11 Fam_3 head undetermined
Note: From the sample input given above, I assumed that the status == wife will be followed by status == head. If my assumption is erroneous do lemme know. the solution will not work in such case.