Pandas update row of a group matching multiple conditions

Question:

new to pandas and first question on stakeoverflow (bare with me) : I have a df of individuals, sometimes regrouped under a family ID. The data of interest here is Gender and Status within the family as follows :

RowID FamilyID Status Gender
1 Fam_1 head undetermined
2 Fam_1 wife female
3 Fam_1 child undetermined
4 Fam_1 child male
5 head male
6 Fam_2 head female
7 Fam_2 child female
8 Fam_3 head undetermined
9 Fam_3 wife female
10 Fam_3 child male
11 Fam_3 head undetermined

Note: see row5 : some individuals are singles (no FamilyID), see Fam_3: some families have several heads (related adults).

Initially, I would need to create a new column Gender_Inferred where Gender_Inferred = male only for the heads of undetermined gender (row1) of families having a wife (in Status) and only one head (Fam_3 excluded because of row11).

I am able to create a mask for families with a wife as follows :

mask1 = df.groupby('FamilyID')['Status'].transform(lambda r: r.eq('wife').any())

a mask for the combined criteria head/undetermined to update:

mask2 = (df['Status'] == 'self') & (df['Gender'] == 'undeterminded')

I am then applying conditions with :

df['Gender_Inferred'] = np.nan

df['Gender_Inferred'] = np.where(mask1 & mask2, 'male', df['Gender'])

But I have not been able to create a mask3 for the condition ‘Family has only 1 Status=head and Gender=undetermined’. It is ‘almost’ as if one wanted to do

df.groupby('FamilyID')[['006_File1_Relation','004_File1_Gender']].transform(lambda r: (r[0].eq('self') & r[1].eq('undeterminded')).count() == 1)

but of course this isn’t proper code.

I would need to have:

RowID FamilyID Status Gender Gender_Inferred
1 Fam_1 head undetermined male
2 Fam_1 wife female female
3 Fam_1 child undetermined undetermined
4 Fam_1 child male male
5 head undetermined undetermined
6 Fam_2 head female female
7 Fam_2 child female female
8 Fam_3 head undetermined undetermined
9 Fam_3 wife female female
10 Fam_3 child male male
11 Fam_3 head undetermined undetermined

masking with groupby or updating with np.where (causing not matching length errors quite often) are not necessary mandatory, I would be happy with any working solution.

Thank you

Asked By: jvb

||

Answers:

Sample input

df = pd.DataFrame([
[1,       "Fam_1",   "head",    "undetermined"],
[2,       "Fam_1",   "wife",    "female"],
[3,       "Fam_1",   "child",   "undetermined"],
[4,       "Fam_1",   "child",   "male"],
[5,       np.NaN,        "head",    "male"],
[6,       "Fam_2",   "head",    "female"],
[7,       "Fam_2",   "child",   "female"],
[8,       "Fam_3",   "head",    "undetermined"],
[9,       "Fam_3",   "wife",    "female"],
[10,      "Fam_3",   "child",   "male"],
[11,      "Fam_3",   "head",    "undetermined"],

], columns=["RowID", "FamilyID", "Status", "Gender"])

Marking FamilyID – nans as Single

df.FamilyID.replace(np.NaN, "Single", inplace=True)

Calculating number of heads in the family

heads_df = df.loc[df.Status == "head"].groupby("FamilyID")["Status"].count().reset_index(name="HeadCount")

Merging the information back to original df

df = df.merge(heads_df, on="FamilyID", how="left")

Adding a new column using shift

df["NextMember" ] = df.Status.shift(-1)

With all the information in place, run the query and assign

df.loc[
    (df.FamilyID != "Single")
    & (df.Status == "head")
    & (df.NextMember == "wife")
    & (df.Gender == "undetermined")
    & (df.HeadCount == 1)
    , "Gender"] = "male"

Drop newly created columns

df.drop(["HeadCount", "NextMember"], inplace=True, axis=1)

output

    RowID   FamilyID    Status  Gender
0   1   Fam_1   head    male
1   2   Fam_1   wife    female
2   3   Fam_1   child   undetermined
3   4   Fam_1   child   male
4   5   Single  head    male
5   6   Fam_2   head    female
6   7   Fam_2   child   female
7   8   Fam_3   head    undetermined
8   9   Fam_3   wife    female
9   10  Fam_3   child   male
10  11  Fam_3   head    undetermined

Note: From the sample input given above, I assumed that the status == wife will be followed by status == head. If my assumption is erroneous do lemme know. the solution will not work in such case.

Answered By: srinath
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.