How do I select the group with the least number of null values in a groupby?

Question

Example:

row_number |id |firstname | middlename | lastname |
0          | 1 | John     | NULL       | Doe      |
1          | 1 | John     | Jacob      | Doe      |
2          | 2 | Alison   | Marie      | Smith    |
3          | 2 | NULL     | Marie      | Smith    |
4          | 2 | Alison   | Marie      | Smith    |

I’m trying to figure out how to groupby id, and then grab the row with the least number of NULL values for each groupby, dropping any extra rows that contain the least number of NULLs is fine (for example, dropping row_number 4 since it ties row_number 2 for the least number of NULLS where id=2)

The answer for this example would be the row_numbers 1 and 2

Preferably would be ANSI SQL, but I can translate other languages (like python with pandas) if you can think of a way to do it

Edit:
Added a row for the case of tie-breaking.

Asked By: Myles Hollowed

||

Source

Answer 1

Oh, you want the rows with the fewest null values. I would suggest:

select t.*
from (select t.*,
             dense_rank() over (order by (case when firstname is null then 1 else 0 end) + 
                                         (case when middlename is null then 1 else 0 end) + 
                                         (case when lastname is null then 1 else 0 end)
                               ) as seqnum

      from t
     ) t
where seqnum = 1;

This is ANSI-standard SQL.

Answered By: Gordon Linoff

Answer 2

If you want to do this pandas, you can do it this way:

df[df.assign(NC = df.isnull().sum(1)).groupby('id')['NC'].transform(lambda x: x == x.min())]

Output:

   row_number  id firstname middlename lastname
1           1   1      John      Jacob      Doe
2           2   2    Alison      Marie    Smith

For tiebreaker:

Add a row:

df.loc[4,['row_number','id','firstname','middlename','lastname']] = ['4',2,'Mary','Maxine','Maxwell']

Then use groupby, transform, and idxmin:

df[df.index == df.assign(NC = df.isnull().sum(1)).groupby('id')['NC'].transform('idxmin')]

Output:

  row_number id firstname middlename lastname
1          1  1      John      Jacob      Doe
2          2  2    Alison      Marie    Smith

Answered By: Scott Boston

How do I select the group with the least number of null values in a groupby?

Question:

Answers: