PySpark Dataframe Groupby and Count Null Values

Question:

I have a Spark Dataframe of the following form:

+------+-------+-----+--------+
| Year | Month | Day | Ticker |
+------+-------+-----+--------+

I am trying to group all of the values by “year” and count the number of missing values in each column per year.

I found the following snippet (forgot where from):

df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show()

This works perfectly when calculating the number of missing values per column. However, I’m not sure how I would modify this to calculate the missing values per year.

Any pointers in the right direction would be much appreciated.

Asked By: user10691834

||

Answers:

You can just use the same logic and add a groupby. Note that I also removed "year" from the aggregated columns, but that’s optional (you would get two ‘year’ columns).

columns = filter(lambda x: x != "year", df.columns)
df.groupBy("year")
  .agg(*(sum(col(c).isNull().cast("int")).alias(c) for c in columns))
  .show()
Answered By: Oli