Create a new value output if the values of another column do not match

Question:

I am working with some complicated patient health data and would like to simplify categorizing the types of insurance patients use. However, some patients use multiple insurance types to pay for a medical visit (ie Medicare and supplemental insurance). I have already categorized insurance as either private or public and would now like to categorize if patients are using more than one insurance type.

My current data set looks like below:

import pandas as pd
data = {'patient_ID': [1,1,1,2,2,3,3,3,4,4,4,4],
   'Public_Private': ['Public', 'Public', 'Public',
                      'Private', 'Public','Private', 'Private', 
                      'Private', 'Private', 'Public', 'Private', 
                      'Public']}
df = pd.DataFrame(data)
df
patient_ID Public_Private
1 Public
1 Private
2 Private
2 Private
2 Private

If the types of insurance differ for a given patient ID, in a new column, I want to return "Mixed". If the types of insurance are the same, I want to return the original value. For example,

patient_ID Public_Private Simple
1 Public Mixed
1 Private Mixed
2 Private Private
2 Private Private
2 Private Private

As you can see, the number of rows for each patient_ID will vary. The code will need to be flexible to account for this.

Asked By: c_hornung

||

Answers:

You can group by the patient id, select the public_private column from the groups, and then apply a transformation, which you can populate back to the original rows of the groups in a new column.

import pandas as pd


data = {
    'patient_ID': [1, 1, 2, 2, 3, 3],
    'Public_Private': ['Public', 'Private',
                       'Private', 'Private',
                       'Public', 'Public'],
    'Other_info': [1.1, 1.2,
                   2.1, 2.2,
                   3.1, 3.2]
}


df = pd.DataFrame(data)

print(df)
# patient_ID Public_Private  Other_info
#          1         Public         1.1
#          1        Private         1.2
#          2        Private         2.1
#          2        Private         2.2
#          3         Public         3.1
#          3         Public         3.2



def simple(g_arr: pd.Series):
    #  ex. g_arr = ['Private', 'Public'] or ['Public', 'Public'], etc.
    if len(set(g_arr)) == 1:
        return g_arr.iloc[0]
    return 'Mixed'


df['Simple'] = df.groupby('patient_ID')['Public_Private'].transform(simple)

print(df)
# patient_ID Public_Private  Other_info   Simple
#          1         Public         1.1    Mixed
#          1        Private         1.2    Mixed
#          2        Private         2.1  Private
#          2        Private         2.2  Private
#          3         Public         3.1   Public
#          3         Public         3.2   Public
Answered By: crunker99