Combine Duplicate Rows in a Column in PySpark Dataframe

Question:

I have duplicate rows in a PySpark data frame and I want to combine and sum all of them into one row per column based on duplicate entries in one column.

Current Table

Deal_ID Title   Customer    In_Progress Deal_Total
30      Deal 1  Client A    350         900
30      Deal 1  Client A    360         850
50      Deal 2  Client B    30          50
30      Deal 1  Client A    125         200
30      Deal 1  Client A    90          100
10      Deal 3  Client C    32          121

Attempted PySpark Code

F.when(F.count(F.col('Deal_ID')) > 1, F.sum(F.col('In_Progress')) && F.sum(F.col('Deal_Total'))))
.otherwise(),

Expected Table

Deal_ID Title   Customer    In_Progress Deal_Total
30      Deal 1  Client A    925         2050
50      Deal 2  Client B    30          50
10      Deal 3  Client C    32          121
Asked By: arnpry

||

Answers:

  • You have a SQL tag, so that’s how it will work in there
select 

deal_id,
title,
customer,
sum(in_progress) as in_progress,
sum(deal_total) as deal_total
from <table_name>
group by 1,2,3

otherwise you can use the same group by function in python pandas / dataframe and apply to your datadrame:

  • you have to pass in the columns that you would need to aggregate by as a list
  • then you need to specify the aggregation type and the column you want to add up
    df = df.groupBy(['deal_id', 'title', 'Customer']).agg({'in_progress': 'sum', ' deal_total': 'sum'})
Answered By: trillion

I think you need to group by the columns with duplicated rows then aggregate the amounts. I think this solves your problem :

df = df.groupBy(['Deal_ID', 'Title', 'Customer']).agg({'In_Progress': 'sum', ' Deal_Total': 'sum'})
Answered By: koding_buse
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.