How to aggregate only non duplicates values using Pandas

Question:

I have the dataframe below:

    ID  COL1    COL2
0   id001   val1    xxxxx
1   id001   val1    yyyyy
2   id002   val2    yyyyy
3   id003   val3    zzzzz
4   id003   val4    zzzzz

And this is the expected output:

ID  COL1    COL2
0   id001   val1    xxxxx|yyyyy
1   id002   val2    yyyyy
2   id003   val3|val4   zzzzz

I made the code below but unfortunately val1 (first row) and zzzzz (last row) are repeated.

df_gr = df[['COL1', 'COL2']].astype(str).groupby(df['ID']).agg('|'.join).reset_index()

    ID  COL1    COL2
0   id001   val1|val1   xxxxx|yyyyy
1   id002   val2    yyyyy
2   id003   val3|val4   zzzzz|zzzzz

Do you know how to fix this, please ?

Asked By: M92_

||

Answers:

Try this

df[['COL1', 'COL2']].astype(str).groupby(['ID','COL1']).agg('|'.join).reset_index()

This will limit you groupby and formulate a single series, after that you just need to append the series into your df

Answered By: INGl0R1AM0R1

Note that I needed to convert an array to a list with string elements before applying join:

df.groupby(['ID']).agg({'COL1': 'unique', 'COL2':'unique'}).applymap(lambda x: '|'.join(map(str, x))).reset_index()
Answered By: Nuri Taş
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.