How do I select the group with the least number of null values in a groupby?
Question:
Example:
row_number |id |firstname | middlename | lastname |
0 | 1 | John | NULL | Doe |
1 | 1 | John | Jacob | Doe |
2 | 2 | Alison | Marie | Smith |
3 | 2 | NULL | Marie | Smith |
4 | 2 | Alison | Marie | Smith |
I’m trying to figure out how to groupby id, and then grab the row with the least number of NULL values for each groupby, dropping any extra rows that contain the least number of NULLs is fine (for example, dropping row_number 4 since it ties row_number 2 for the least number of NULLS where id=2)
The answer for this example would be the row_numbers 1 and 2
Preferably would be ANSI SQL, but I can translate other languages (like python with pandas) if you can think of a way to do it
Edit:
Added a row for the case of tie-breaking.
Answers:
Oh, you want the rows with the fewest null
values. I would suggest:
select t.*
from (select t.*,
dense_rank() over (order by (case when firstname is null then 1 else 0 end) +
(case when middlename is null then 1 else 0 end) +
(case when lastname is null then 1 else 0 end)
) as seqnum
from t
) t
where seqnum = 1;
This is ANSI-standard SQL.
If you want to do this pandas, you can do it this way:
df[df.assign(NC = df.isnull().sum(1)).groupby('id')['NC'].transform(lambda x: x == x.min())]
Output:
row_number id firstname middlename lastname
1 1 1 John Jacob Doe
2 2 2 Alison Marie Smith
For tiebreaker:
Add a row:
df.loc[4,['row_number','id','firstname','middlename','lastname']] = ['4',2,'Mary','Maxine','Maxwell']
Then use groupby
, transform
, and idxmin
:
df[df.index == df.assign(NC = df.isnull().sum(1)).groupby('id')['NC'].transform('idxmin')]
Output:
row_number id firstname middlename lastname
1 1 1 John Jacob Doe
2 2 2 Alison Marie Smith
Example:
row_number |id |firstname | middlename | lastname |
0 | 1 | John | NULL | Doe |
1 | 1 | John | Jacob | Doe |
2 | 2 | Alison | Marie | Smith |
3 | 2 | NULL | Marie | Smith |
4 | 2 | Alison | Marie | Smith |
I’m trying to figure out how to groupby id, and then grab the row with the least number of NULL values for each groupby, dropping any extra rows that contain the least number of NULLs is fine (for example, dropping row_number 4 since it ties row_number 2 for the least number of NULLS where id=2)
The answer for this example would be the row_numbers 1 and 2
Preferably would be ANSI SQL, but I can translate other languages (like python with pandas) if you can think of a way to do it
Edit:
Added a row for the case of tie-breaking.
Oh, you want the rows with the fewest null
values. I would suggest:
select t.*
from (select t.*,
dense_rank() over (order by (case when firstname is null then 1 else 0 end) +
(case when middlename is null then 1 else 0 end) +
(case when lastname is null then 1 else 0 end)
) as seqnum
from t
) t
where seqnum = 1;
This is ANSI-standard SQL.
If you want to do this pandas, you can do it this way:
df[df.assign(NC = df.isnull().sum(1)).groupby('id')['NC'].transform(lambda x: x == x.min())]
Output:
row_number id firstname middlename lastname
1 1 1 John Jacob Doe
2 2 2 Alison Marie Smith
For tiebreaker:
Add a row:
df.loc[4,['row_number','id','firstname','middlename','lastname']] = ['4',2,'Mary','Maxine','Maxwell']
Then use groupby
, transform
, and idxmin
:
df[df.index == df.assign(NC = df.isnull().sum(1)).groupby('id')['NC'].transform('idxmin')]
Output:
row_number id firstname middlename lastname
1 1 1 John Jacob Doe
2 2 2 Alison Marie Smith