How to group and filter a dataframe in PySpark
Question:
Attached is my dataframe and I’m trying to find a way in pySpark to filter Link Name for which none of the Supports are with Status ‘In’. For eg. The expected output should be only Link3 as none of the supports associated to it are with ‘In’
Link Name
Support
Status
Link1
Support1
In
Link1
Support2
In
Link1
Support3
Out
Link2
Support4
In
Link2
Support5
In
Link3
Support6
Out
Link3
Support7
Out
Can someone please help me here?
The expected output should be only Link3 as none of the supports associated to it are without ‘In’
Answers:
You can try something like this with window function
import pyspark.sql.functions as F
from pyspark.sql import Window
inputData = [
("Link1", "Support1", "In"),
("Link1", "Support2", "In"),
("Link1", "Support3", "Out"),
("Link2", "Support4", "In"),
("Link2", "Support5", "In"),
("Link3", "Support6", "Out"),
("Link3", "Support7", "Out"),
]
inputDf = spark.createDataFrame(inputData, schema=["Link Name", "Support", "Status"])
window = Window.partitionBy("Link Name").orderBy(F.col("Status").asc())
dfWithRank = inputDf.withColumn("dense_rank", F.dense_rank().over(window))
dfWithRank.filter(
(F.col("dense_rank") == F.lit(1)) & (F.col("Status") == F.lit("Out"))
).select("Link Name").distinct().show()
I am grouping by link name and sorting by status within group. If first status within group sorted ascending is "Out" it means that "In" status does not exists for such partition and thats what filter is checking
At the end i am selecting only Link Name and calling distinct to get just single record with Link Name
Output is
+---------+
|Link Name|
+---------+
| Link3|
+---------+
Attached is my dataframe and I’m trying to find a way in pySpark to filter Link Name for which none of the Supports are with Status ‘In’. For eg. The expected output should be only Link3 as none of the supports associated to it are with ‘In’
Link Name | Support | Status |
---|---|---|
Link1 | Support1 | In |
Link1 | Support2 | In |
Link1 | Support3 | Out |
Link2 | Support4 | In |
Link2 | Support5 | In |
Link3 | Support6 | Out |
Link3 | Support7 | Out |
Can someone please help me here?
The expected output should be only Link3 as none of the supports associated to it are without ‘In’
You can try something like this with window function
import pyspark.sql.functions as F
from pyspark.sql import Window
inputData = [
("Link1", "Support1", "In"),
("Link1", "Support2", "In"),
("Link1", "Support3", "Out"),
("Link2", "Support4", "In"),
("Link2", "Support5", "In"),
("Link3", "Support6", "Out"),
("Link3", "Support7", "Out"),
]
inputDf = spark.createDataFrame(inputData, schema=["Link Name", "Support", "Status"])
window = Window.partitionBy("Link Name").orderBy(F.col("Status").asc())
dfWithRank = inputDf.withColumn("dense_rank", F.dense_rank().over(window))
dfWithRank.filter(
(F.col("dense_rank") == F.lit(1)) & (F.col("Status") == F.lit("Out"))
).select("Link Name").distinct().show()
I am grouping by link name and sorting by status within group. If first status within group sorted ascending is "Out" it means that "In" status does not exists for such partition and thats what filter is checking
At the end i am selecting only Link Name and calling distinct to get just single record with Link Name
Output is
+---------+
|Link Name|
+---------+
| Link3|
+---------+