Combine Duplicate Rows in a Column in PySpark Dataframe

Question

I have duplicate rows in a PySpark data frame and I want to combine and sum all of them into one row per column based on duplicate entries in one column.

Current Table

Deal_ID Title   Customer    In_Progress Deal_Total
30      Deal 1  Client A    350         900
30      Deal 1  Client A    360         850
50      Deal 2  Client B    30          50
30      Deal 1  Client A    125         200
30      Deal 1  Client A    90          100
10      Deal 3  Client C    32          121

Attempted PySpark Code

F.when(F.count(F.col('Deal_ID')) > 1, F.sum(F.col('In_Progress')) && F.sum(F.col('Deal_Total'))))
.otherwise(),

Expected Table

Deal_ID Title   Customer    In_Progress Deal_Total
30      Deal 1  Client A    925         2050
50      Deal 2  Client B    30          50
10      Deal 3  Client C    32          121

Asked By: arnpry

||

Source

Answer 1

You have a SQL tag, so that’s how it will work in there

select 

deal_id,
title,
customer,
sum(in_progress) as in_progress,
sum(deal_total) as deal_total
from <table_name>
group by 1,2,3

otherwise you can use the same group by function in python pandas / dataframe and apply to your datadrame:

you have to pass in the columns that you would need to aggregate by as a list
then you need to specify the aggregation type and the column you want to add up

    df = df.groupBy(['deal_id', 'title', 'Customer']).agg({'in_progress': 'sum', ' deal_total': 'sum'})

Answered By: trillion

Answer 2

I think you need to group by the columns with duplicated rows then aggregate the amounts. I think this solves your problem :

df = df.groupBy(['Deal_ID', 'Title', 'Customer']).agg({'In_Progress': 'sum', ' Deal_Total': 'sum'})

Answered By: koding_buse

Combine Duplicate Rows in a Column in PySpark Dataframe

Question:

Answers: