Adding a column in a dataframe doing different operation for different rows of existing column

Question

FOOD_ID   SAMPLE_NO   ELEMENT1   ELEMENT2   ELEMENT3
 F110       A1          0.4        0.2        0.1
 F110       A2          0.6        0.1        0.3
 F110       B1          0.4        0.3        0.7
 F110       B2          0.5        0.6        0.9
 F110       C1          0.5        0.3        0.4
 F110       C2          0.6        0.2        0.6
 F110       C3          0.1        0.1        0.5
 F120       B1          0.4        0.2        0.2
 F120       B2          0.5        0.2        0.5
 F120       B3          0.7        0.3        0.8
 F120       B4          0.7        0.7        0.9
 F120       B5          0.2        0.9        0.1

My data looks like above. I want to add columns that give the average of the elements 1,2,3 for the food id.
if sample C is available then the average will be simply the average of C sample, if C sample is not present the average will be simply the latest of the B sample.

For food f110 C sample is present, that is why the average will be Average of C sample.
For food f120 C sample not present, that is why average will be latest of B sample, B5.

The dataframe that I finally want looks like below…

FOOD_ID   ELEMENT1_AVG          ELEMENT2_AVG       ELEMENT3_AVG
F110     (0.5+0.6+0.1)/3=0.4  (0.3+0.2+0.1)=0.2  (0.4+0.6+0.5)=0.5
F120         0.2                    0.9               0.1

Need help.
Thanks in advance.

Asked By: Temp_coder

||

Source

Answer 1

You can get mask of rows with C by Series.str.startswith, for last B first test if not exist C per group with GroupBy.transform and GroupBy.any to m and invert mask by ~, chain with b and filter by df[(~m & b)], and then get last B by Series.duplicated with invert ~.

Then filter original DataFrame with c with Series.combine_first and aggregate mean, because B are only one row per groups it return same values:

c = df['SAMPLE_NO'].str.startswith('C')
b = df['SAMPLE_NO'].str.startswith('B')
m = c.groupby(df['FOOD_ID']).transform('any')

mask = ~df[(~m & b)].duplicated(subset=['FOOD_ID'], keep='last')
df1 = (df[mask.combine_first(c)]
              .groupby('FOOD_ID')
              .mean()
              .add_suffix('_AVG')
              .reset_index())
print (df1)
  FOOD_ID  ELEMENT1_AVG  ELEMENT2_AVG  ELEMENT3_AVG
0    F110           0.4           0.2           0.5
1    F120           0.2           0.9           0.1

Answered By: jezrael

Answer 2

def function1(dd:pd.DataFrame):
    return dd.query("SAMPLE_NO.str.contains('C')").mean() if len(dd.query("SAMPLE_NO.str.contains('C')"))>0 else dd.query("SAMPLE_NO.str.contains('B')").iloc[-1,2:]

df1.groupby('FOOD_ID').apply(function1).rename(columns=lambda ss:f'{ss}_AVG')

       ELEMENT1_AVG  ELEMENT2_AVG  ELEMENT3_AVG
FOOD_ID                                          
F110              0.4           0.2           0.5
F120              0.2           0.9           0.1

Answered By: G.G

Adding a column in a dataframe doing different operation for different rows of existing column

Question:

Answers: