PySpark Dataframe Groupby and Count Null Values


I have a Spark Dataframe of the following form:

| Year | Month | Day | Ticker |

I am trying to group all of the values by “year” and count the number of missing values in each column per year.

I found the following snippet (forgot where from):*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show()

This works perfectly when calculating the number of missing values per column. However, I’m not sure how I would modify this to calculate the missing values per year.

Any pointers in the right direction would be much appreciated.

Asked By: user10691834



You can just use the same logic and add a groupby. Note that I also removed "year" from the aggregated columns, but that’s optional (you would get two ‘year’ columns).

columns = filter(lambda x: x != "year", df.columns)
  .agg(*(sum(col(c).isNull().cast("int")).alias(c) for c in columns))
Answered By: Oli