Combine Duplicate Rows in a Column in PySpark Dataframe
Question:
I have duplicate rows in a PySpark
data frame and I want to combine and sum
all of them into one row per column based on duplicate entries in one column.
Current Table
Deal_ID Title Customer In_Progress Deal_Total
30 Deal 1 Client A 350 900
30 Deal 1 Client A 360 850
50 Deal 2 Client B 30 50
30 Deal 1 Client A 125 200
30 Deal 1 Client A 90 100
10 Deal 3 Client C 32 121
Attempted PySpark Code
F.when(F.count(F.col('Deal_ID')) > 1, F.sum(F.col('In_Progress')) && F.sum(F.col('Deal_Total'))))
.otherwise(),
Expected Table
Deal_ID Title Customer In_Progress Deal_Total
30 Deal 1 Client A 925 2050
50 Deal 2 Client B 30 50
10 Deal 3 Client C 32 121
Answers:
- You have a SQL tag, so that’s how it will work in there
select
deal_id,
title,
customer,
sum(in_progress) as in_progress,
sum(deal_total) as deal_total
from <table_name>
group by 1,2,3
otherwise you can use the same group by
function in python pandas / dataframe and apply to your datadrame:
- you have to pass in the columns that you would need to aggregate by as a list
- then you need to specify the aggregation type and the column you want to add up
df = df.groupBy(['deal_id', 'title', 'Customer']).agg({'in_progress': 'sum', ' deal_total': 'sum'})
I think you need to group by the columns with duplicated rows then aggregate the amounts. I think this solves your problem :
df = df.groupBy(['Deal_ID', 'Title', 'Customer']).agg({'In_Progress': 'sum', ' Deal_Total': 'sum'})
I have duplicate rows in a PySpark
data frame and I want to combine and sum
all of them into one row per column based on duplicate entries in one column.
Current Table
Deal_ID Title Customer In_Progress Deal_Total
30 Deal 1 Client A 350 900
30 Deal 1 Client A 360 850
50 Deal 2 Client B 30 50
30 Deal 1 Client A 125 200
30 Deal 1 Client A 90 100
10 Deal 3 Client C 32 121
Attempted PySpark Code
F.when(F.count(F.col('Deal_ID')) > 1, F.sum(F.col('In_Progress')) && F.sum(F.col('Deal_Total'))))
.otherwise(),
Expected Table
Deal_ID Title Customer In_Progress Deal_Total
30 Deal 1 Client A 925 2050
50 Deal 2 Client B 30 50
10 Deal 3 Client C 32 121
- You have a SQL tag, so that’s how it will work in there
select
deal_id,
title,
customer,
sum(in_progress) as in_progress,
sum(deal_total) as deal_total
from <table_name>
group by 1,2,3
otherwise you can use the same group by
function in python pandas / dataframe and apply to your datadrame:
- you have to pass in the columns that you would need to aggregate by as a list
- then you need to specify the aggregation type and the column you want to add up
df = df.groupBy(['deal_id', 'title', 'Customer']).agg({'in_progress': 'sum', ' deal_total': 'sum'})
I think you need to group by the columns with duplicated rows then aggregate the amounts. I think this solves your problem :
df = df.groupBy(['Deal_ID', 'Title', 'Customer']).agg({'In_Progress': 'sum', ' Deal_Total': 'sum'})