PySpark Dataframe Groupby and Count Null Values
Question:
I have a Spark Dataframe of the following form:
+------+-------+-----+--------+
| Year | Month | Day | Ticker |
+------+-------+-----+--------+
I am trying to group all of the values by “year” and count the number of missing values in each column per year.
I found the following snippet (forgot where from):
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show()
This works perfectly when calculating the number of missing values per column. However, I’m not sure how I would modify this to calculate the missing values per year.
Any pointers in the right direction would be much appreciated.
Answers:
You can just use the same logic and add a groupby
. Note that I also removed "year" from the aggregated columns, but that’s optional (you would get two ‘year’ columns).
columns = filter(lambda x: x != "year", df.columns)
df.groupBy("year")
.agg(*(sum(col(c).isNull().cast("int")).alias(c) for c in columns))
.show()
I have a Spark Dataframe of the following form:
+------+-------+-----+--------+
| Year | Month | Day | Ticker |
+------+-------+-----+--------+
I am trying to group all of the values by “year” and count the number of missing values in each column per year.
I found the following snippet (forgot where from):
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show()
This works perfectly when calculating the number of missing values per column. However, I’m not sure how I would modify this to calculate the missing values per year.
Any pointers in the right direction would be much appreciated.
You can just use the same logic and add a groupby
. Note that I also removed "year" from the aggregated columns, but that’s optional (you would get two ‘year’ columns).
columns = filter(lambda x: x != "year", df.columns)
df.groupBy("year")
.agg(*(sum(col(c).isNull().cast("int")).alias(c) for c in columns))
.show()