Get distinct count of values in single row in Pyspark DataFrame

Question:

I’m trying to separate values in a single row in a Pyspark dataframe column so it returns the individual values and their count.

The data I have is formatted as such:

+--------------------+
|                tags|
+--------------------+
|cult, horror, got...|
|            violence|
|            romantic|
|inspiring, romant...|
|cruelty, murder, ...|
|romantic, queer, ...|
|gothic, cruelty, ...|
|mystery, suspense...|
|            violence|
|revenge, neo noir...|
+--------------------+

And I want the result to look like

+--------------------+-----+
|                tags|count|
+--------------------+-----+
|cult                |    4|
|horror              |   10|
|goth                |    4|
|violence            |   30|
...

The code I’ve tried that hasn’t worked is below:

data.select('tags').groupby('tags').count().show(10)

I also used a countdistinct function which also failed to work.

I feel like I need to have a function that separates the values by comma and then lists them but not sure how to execute them.

Asked By: slycooper

||

Answers:

You can use split() to split strings, then explode(). Finally, groupby and count:

import pyspark.sql.functions as F

df = spark.createDataFrame(data=[
    ["cult,horror"],
    ["cult,comedy"],
    ["romantic,comedy"],
    ["thriler,horror,comedy"],
], schema=["tags"])

df = df 
  .withColumn("tags", F.split("tags", pattern=",")) 
  .withColumn("tags", F.explode("tags"))

df = df.groupBy("tags").count()

[Out]:
+--------+-----+
|tags    |count|
+--------+-----+
|romantic|1    |
|thriler |1    |
|horror  |2    |
|cult    |2    |
|comedy  |3    |
+--------+-----+
Answered By: Azhar Khan
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.