Concatenating string by rows in pyspark

Question:

I am having a pyspark dataframe as

DOCTOR | PATIENT
JOHN   | SAM
JOHN   | PETER
JOHN   | ROBIN
BEN    | ROSE
BEN    | GRAY

and need to concatenate patient names by rows so that I get the output like:

DOCTOR | PATIENT
JOHN   | SAM, PETER, ROBIN
BEN    | ROSE, GRAY

Can anybody help me regarding creating this dataframe in pyspark ?

Thanks in advance.

Asked By: Prerit Saxena

||

Answers:

The simplest way I can think of is to use collect_list

import pyspark.sql.functions as f
df.groupby("col1").agg(f.concat_ws(", ", f.collect_list(df.col2)))
Answered By: Assaf Mendelson
import pyspark.sql.functions as f
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType

data = [
  ("U_104", "food"),
  ("U_103", "cosmetics"),
  ("U_103", "children"),
  ("U_104", "groceries"),
  ("U_103", "food")
]
schema = StructType([
  StructField("user_id", StringType(), True),
  StructField("category", StringType(), True),
])
sc = SparkContext.getOrCreate()
spark = SparkSession.builder.appName("groupby").getOrCreate()
df = spark.createDataFrame(data, schema)
group_df = df.groupBy(f.col("user_id")).agg(
  f.concat_ws(",", f.collect_list(f.col("category"))).alias("categories")
)
group_df.show()
+-------+--------------------+
|user_id|          categories|
+-------+--------------------+
|  U_104|      food,groceries|
|  U_103|cosmetics,childre...|
+-------+--------------------+

There are some useful aggregation examples

Answered By: DevShepherd

Using Spark SQL this worked for me:

SELECT col1, col2, col3, REPLACE(REPLACE(CAST(collect_list(col4) AS string),"[",""),"]","")
FROM your_table
GROUP BY col1, col2, col3
Answered By: Konrad
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.