Fill columns based on groups and conditions in another column

Question:

Consider the below Pandas DataFrame

df = pd.DataFrame({'Make': ['Tesla','Tesla','Tesla','Toyota','Ford','Ford','Ford','BMW','BMW','BMW','Mercedes','Mercedes','Mercedes','Jeep','Jeep','Jeep'],
                   'Type': ['Model X','Model X','Model X','Corolla','Bronco','Bronco','Mustang','3 Series','3 Series','7 Series','C-Class','C-Class','S-Class','Wrangler','Compass','Patriot'],
                   'Year': [2015, 2015, 2015, 2017, 2018, 2018, 2020, 2015, 2015, 2017, 2018, 2018, 2020,2020,2021,2020],
                   'Price': [85000, 90000, 95000, 20000, 35000, 35000, 45000, 40000, 40000, 65000, 50000, 50000, 75000,60000,45000,40000],
                   'Color': ['White','White','White','Red','Blue','Blue','Yellow','Silver','Silver','Black','White','White','Black','Grey','Brown','Green'],
                   'Code'  : ['TSLABG','TSLA',None,'TYTA','FRD','_BG',None,None,'BMW','BMW','MercedesBG','Mercedes_BG','MercedesBG',None,'_BG','JeepBG']
                  })
df
     Make   Type       Year Price   Color   Code
0   Tesla   Model X    2015 85000   White   TSLABG
1   Tesla   Model X    2015 90000   White   TSLA
2   Tesla   Model X    2015 95000   White   None
3   Toyota  Corolla    2017 20000   Red     TYTA
4    Ford   Bronco     2018 35000   Blue    FRD 
5    Ford   Bronco     2018 35000   Blue    _BG
6    Ford   Mustang    2020 45000   Yellow  None
7     BMW   3 Series   2015 40000   Silver  None
8     BMW   3 Series   2015 40000   Silver  BMW
9     BMW   7 Series   2017 65000   Black   BMW
10 Mercedes C-Class    2018 50000   White   MercedesBG
11 Mercedes C-Class    2018 50000   White   Mercedes_BG
12 Mercedes S-Class    2020 75000   Black   MercedesBG
13   Jeep   Wrangler   2020 60000   Grey    None
14   Jeep   Compass    2021 45000   Brown   _BG
15   Jeep   Patriot    2020 40000   Green   JeepBG

I am trying to update the Code Column based on the Make. If Code column is having None, it must be correctly filled based on the other values of Code column for the same Make. In other words if any Make is having Code defined in the Code column that value should be used to fill None value in the Code column, also if BG or _BG is appended to any code value of the same Make, all the Code values should be appended with the BG or _BG respectively for the same Make with BG taking precedence over _BG

Since BMW is having no BG or _BG for already existing BMW code values, when replacing the None it doesn’t get appended with BG or _BG. For Ford _BG is present for one of the values and FRD is present for another value so all Code values of Ford should be FRD_BG.

For Mercedes there are Code values with BG appended to the Code, so if any Code value is having _ prepended to the BG, the _ should be removed from the Code Value

Similarly for Jeep I am expecting the output to be JeepBG for all the Jeep Make since if ‘BG’ and ‘_BG’ comes up BG takes priority over _BG

Each make would definitely have a code value specified in one of the entries of the group.

The solution I have tried

code = (df['Code'].str.split('(BG|_BG)', expand=True).add_prefix('part').replace('-', None).groupby(df['Make']).transform('first').fillna('').agg(''.join, axis=1))
df['Code'] = code
df

The output I got has _BG being populated for all the Jeep Code and it should’ve been JeepBG.

     Make   Type       Year Price   Color   Code
0   Tesla   Model X    2015 85000   White   TSLABG
1   Tesla   Model X    2015 90000   White   TSLABG
2   Tesla   Model X    2015 95000   White   TSLABG
3   Toyota  Corolla    2017 20000   Red     TYTA
4    Ford   Bronco     2018 35000   Blue    FRD_BG
5    Ford   Bronco     2018 35000   Blue    FRD_BG
6    Ford   Mustang    2020 45000   Yellow  FRD_BG
7     BMW   3 Series   2015 40000   Silver  BMW
8     BMW   3 Series   2015 40000   Silver  BMW
9     BMW   7 Series   2017 65000   Black   BMW
10 Mercedes C-Class    2018 50000   White   MercedesBG
11 Mercedes C-Class    2018 50000   White   MercedesBG
12 Mercedes S-Class    2020 75000   Black   MercedesBG
13   Jeep   Wrangler   2020 60000   Grey    _BG
14   Jeep   Compass    2021 45000   Brown   _BG
15   Jeep   Patriot    2020 40000   Green   _BG

The expected output is:

     Make   Type       Year Price   Color   Code
0   Tesla   Model X    2015 85000   White   TSLABG
1   Tesla   Model X    2015 90000   White   TSLABG
2   Tesla   Model X    2015 95000   White   TSLABG
3   Toyota  Corolla    2017 20000   Red     TYTA
4    Ford   Bronco     2018 35000   Blue    FRD_BG
5    Ford   Bronco     2018 35000   Blue    FRD_BG
6    Ford   Mustang    2020 45000   Yellow  FRD_BG
7     BMW   3 Series   2015 40000   Silver  BMW
8     BMW   3 Series   2015 40000   Silver  BMW
9     BMW   7 Series   2017 65000   Black   BMW
10 Mercedes C-Class    2018 50000   White   MercedesBG
11 Mercedes C-Class    2018 50000   White   MercedesBG
12 Mercedes S-Class    2020 75000   Black   MercedesBG
13   Jeep   Wrangler   2020 60000   Grey    JeepBG
14   Jeep   Compass    2021 45000   Brown   JeepBG
15   Jeep   Patriot    2020 40000   Green   JeepBG
Asked By: Sudeep George

||

Answers:

Let’s do this in steps, since there are many requirements

First, let’s define columns is_bg and isbg, which will aid us throughout the process:

grouper = df.groupby('Make')['Code']
df['is_bg'] = grouper.transform(lambda s: s.str.endswith('_BG').astype(bool))
df['isbg']  = grouper.transform(lambda s: s.str.endswith('BG') 
                                   & ~s.str.endswith('_BG').astype(bool))

Now, let’s remove _BG and BG suffixes so that we can find the unique identifier for each Code:

df.loc[df.isbg,  'Code'] = df.loc[df.isbg,  'Code'].str[:-2]
df.loc[df.is_bg, 'Code'] = df.loc[df.is_bg, 'Code'].str[:-3]

Now, we create a mapper and assign:

mapper = df.groupby('Make').apply(lambda s:
                           # Retrieve the unique code
                           s.loc[s['Code'].str.len() > 1,'Code'].iloc[0] + 
                           # Append BG if `isbg`
                           ('BG' if s.isbg.any() else 
                           # Otherwise, append _BG if `is_bg`
                           ('_BG' if s.is_bg.any() else '')))
df['New_Code'] = df['Make'].map(mapper)

Notice that there’s an assumption here that for each Make, we’ll only have one entry in Code that, after having _BG and BG removed, will NOT be either whitespace or None. If this is not true, then the problem is ambiguous and you’ll need to explore how to find the unique Code identifier.


        Make      Type  Year  Price      Code   isbg  is_bg    New_Code
0      Tesla   Model X  2015  85000      TSLA   True  False      TSLABG
1      Tesla   Model X  2015  90000      TSLA  False  False      TSLABG
2      Tesla   Model X  2015  95000      None  False  False      TSLABG
3     Toyota   Corolla  2017  20000      TYTA  False  False        TYTA
4       Ford    Bronco  2018  35000       FRD  False  False      FRD_BG
5       Ford    Bronco  2018  35000            False   True      FRD_BG
6       Ford   Mustang  2020  45000      None  False  False      FRD_BG
7        BMW  3 Series  2015  40000      None  False  False         BMW
8        BMW  3 Series  2015  40000       BMW  False  False         BMW
9        BMW  7 Series  2017  65000       BMW  False  False         BMW
10  Mercedes   C-Class  2018  50000  Mercedes   True  False  MercedesBG
11  Mercedes   C-Class  2018  50000  Mercedes  False   True  MercedesBG
12  Mercedes   S-Class  2020  75000  Mercedes   True  False  MercedesBG
13      Jeep  Wrangler  2020  60000      None  False  False      JeepBG
14      Jeep   Compass  2021  45000            False   True      JeepBG
15      Jeep   Patriot  2020  40000      Jeep   True  False      JeepBG
Answered By: rafaelc

remove _BG and BG / fill value except BG

s = df['Code'].str.replace(r'[_]?BG', '', regex=True).replace({'': None}).groupby(df['Make']).transform('first')

s

0         TSLA
1         TSLA
2         TSLA
3         TYTA
4          FRD
5          FRD
6          FRD
7          BMW
8          BMW
9          BMW
10    Mercedes
11    Mercedes
12    Mercedes
13        Jeep
14        Jeep
15        Jeep

extract BG and _BG and so on..

df1 = df['Code'].str.extract(r'[^_]+(BG)|(_BG)').groupby(df['Make']).ffill().groupby(df['Make']).bfill()

df1

    0   1
0   BG  NaN
1   BG  NaN
2   BG  NaN
3   NaN NaN
4   NaN _BG
5   NaN _BG
6   NaN _BG
7   NaN NaN
8   NaN NaN
9   NaN NaN
10  BG  _BG
11  BG  _BG
12  BG  _BG
13  BG  _BG
14  BG  _BG
15  BG  _BG

fillna df1 and string concat

df.assign(Code=s.str.cat(df1[0].fillna(df1[1]).fillna('')))

out

    Make        Type        Year    Price   Color   Code
0   Tesla       Model X     2015    85000   White   TSLABG
1   Tesla       Model X     2015    90000   White   TSLABG
2   Tesla       Model X     2015    95000   White   TSLABG
3   Toyota      Corolla     2017    20000   Red     TYTA
4   Ford        Bronco      2018    35000   Blue    FRD_BG
5   Ford        Bronco      2018    35000   Blue    FRD_BG
6   Ford        Mustang     2020    45000   Yellow  FRD_BG
7   BMW         3 Series    2015    40000   Silver  BMW
8   BMW         3 Series    2015    40000   Silver  BMW
9   BMW         7 Series    2017    65000   Black   BMW
10  Mercedes    C-Class     2018    50000   White   MercedesBG
11  Mercedes    C-Class     2018    50000   White   MercedesBG
12  Mercedes    S-Class     2020    75000   Black   MercedesBG
13  Jeep        Wrangler    2020    60000   Grey    JeepBG
14  Jeep        Compass     2021    45000   Brown   JeepBG
15  Jeep        Patriot     2020    40000   Green   JeepBG
Answered By: Panda Kim

Building off of Panda Kim’s solution

lhs = df['Code'].str.replace(r'[_]?BG', '', regex=True).replace({'': None}).groupby(df['Make']).transform('first')

rhs = df['Code'].str.extract(r'[^_]+(BG)|(_BG)',expand =False).groupby(df['Make']).ffill().groupby(df['Make']).bfill()

rhs = rhs[0].fillna((rhs[1]).fillna(''))

df['Code'] = lhs + rhs
Answered By: Sudeep George