Adding a column in a dataframe doing different operation for different rows of existing column
Question:
FOOD_ID SAMPLE_NO ELEMENT1 ELEMENT2 ELEMENT3
F110 A1 0.4 0.2 0.1
F110 A2 0.6 0.1 0.3
F110 B1 0.4 0.3 0.7
F110 B2 0.5 0.6 0.9
F110 C1 0.5 0.3 0.4
F110 C2 0.6 0.2 0.6
F110 C3 0.1 0.1 0.5
F120 B1 0.4 0.2 0.2
F120 B2 0.5 0.2 0.5
F120 B3 0.7 0.3 0.8
F120 B4 0.7 0.7 0.9
F120 B5 0.2 0.9 0.1
My data looks like above. I want to add columns that give the average of the elements 1,2,3 for the food id.
if sample C is available then the average will be simply the average of C sample, if C sample is not present the average will be simply the latest of the B sample.
For food f110 C sample is present, that is why the average will be Average of C sample.
For food f120 C sample not present, that is why average will be latest of B sample, B5.
The dataframe that I finally want looks like below…
FOOD_ID ELEMENT1_AVG ELEMENT2_AVG ELEMENT3_AVG
F110 (0.5+0.6+0.1)/3=0.4 (0.3+0.2+0.1)=0.2 (0.4+0.6+0.5)=0.5
F120 0.2 0.9 0.1
Need help.
Thanks in advance.
Answers:
You can get mask of rows with C
by Series.str.startswith
, for last B
first test if not exist C
per group with GroupBy.transform
and GroupBy.any
to m
and invert mask by ~
, chain with b
and filter by df[(~m & b)]
, and then get last B
by Series.duplicated
with invert ~
.
Then filter original DataFrame
with c
with Series.combine_first
and aggregate mean
, because B
are only one row per groups it return same values:
c = df['SAMPLE_NO'].str.startswith('C')
b = df['SAMPLE_NO'].str.startswith('B')
m = c.groupby(df['FOOD_ID']).transform('any')
mask = ~df[(~m & b)].duplicated(subset=['FOOD_ID'], keep='last')
df1 = (df[mask.combine_first(c)]
.groupby('FOOD_ID')
.mean()
.add_suffix('_AVG')
.reset_index())
print (df1)
FOOD_ID ELEMENT1_AVG ELEMENT2_AVG ELEMENT3_AVG
0 F110 0.4 0.2 0.5
1 F120 0.2 0.9 0.1
def function1(dd:pd.DataFrame):
return dd.query("SAMPLE_NO.str.contains('C')").mean() if len(dd.query("SAMPLE_NO.str.contains('C')"))>0 else dd.query("SAMPLE_NO.str.contains('B')").iloc[-1,2:]
df1.groupby('FOOD_ID').apply(function1).rename(columns=lambda ss:f'{ss}_AVG')
ELEMENT1_AVG ELEMENT2_AVG ELEMENT3_AVG
FOOD_ID
F110 0.4 0.2 0.5
F120 0.2 0.9 0.1
FOOD_ID SAMPLE_NO ELEMENT1 ELEMENT2 ELEMENT3
F110 A1 0.4 0.2 0.1
F110 A2 0.6 0.1 0.3
F110 B1 0.4 0.3 0.7
F110 B2 0.5 0.6 0.9
F110 C1 0.5 0.3 0.4
F110 C2 0.6 0.2 0.6
F110 C3 0.1 0.1 0.5
F120 B1 0.4 0.2 0.2
F120 B2 0.5 0.2 0.5
F120 B3 0.7 0.3 0.8
F120 B4 0.7 0.7 0.9
F120 B5 0.2 0.9 0.1
My data looks like above. I want to add columns that give the average of the elements 1,2,3 for the food id.
if sample C is available then the average will be simply the average of C sample, if C sample is not present the average will be simply the latest of the B sample.
For food f110 C sample is present, that is why the average will be Average of C sample.
For food f120 C sample not present, that is why average will be latest of B sample, B5.
The dataframe that I finally want looks like below…
FOOD_ID ELEMENT1_AVG ELEMENT2_AVG ELEMENT3_AVG
F110 (0.5+0.6+0.1)/3=0.4 (0.3+0.2+0.1)=0.2 (0.4+0.6+0.5)=0.5
F120 0.2 0.9 0.1
Need help.
Thanks in advance.
You can get mask of rows with C
by Series.str.startswith
, for last B
first test if not exist C
per group with GroupBy.transform
and GroupBy.any
to m
and invert mask by ~
, chain with b
and filter by df[(~m & b)]
, and then get last B
by Series.duplicated
with invert ~
.
Then filter original DataFrame
with c
with Series.combine_first
and aggregate mean
, because B
are only one row per groups it return same values:
c = df['SAMPLE_NO'].str.startswith('C')
b = df['SAMPLE_NO'].str.startswith('B')
m = c.groupby(df['FOOD_ID']).transform('any')
mask = ~df[(~m & b)].duplicated(subset=['FOOD_ID'], keep='last')
df1 = (df[mask.combine_first(c)]
.groupby('FOOD_ID')
.mean()
.add_suffix('_AVG')
.reset_index())
print (df1)
FOOD_ID ELEMENT1_AVG ELEMENT2_AVG ELEMENT3_AVG
0 F110 0.4 0.2 0.5
1 F120 0.2 0.9 0.1
def function1(dd:pd.DataFrame):
return dd.query("SAMPLE_NO.str.contains('C')").mean() if len(dd.query("SAMPLE_NO.str.contains('C')"))>0 else dd.query("SAMPLE_NO.str.contains('B')").iloc[-1,2:]
df1.groupby('FOOD_ID').apply(function1).rename(columns=lambda ss:f'{ss}_AVG')
ELEMENT1_AVG ELEMENT2_AVG ELEMENT3_AVG
FOOD_ID
F110 0.4 0.2 0.5
F120 0.2 0.9 0.1