Pandas keep the most complete rows

Question:

Lets say I have a dataframe that has a lot of missing data:

df = pd.DataFrame({'id': ['a','a','b','b','b','c','d','e','e','e'],
                   'q1': [1,1,np.NaN,np.NaN,0,np.NaN,1,np.NaN,1,0],
                   'q2': ['low',np.NaN,np.NaN,'high','low','high','high',np.NaN,np.NaN,'low'],
                   'q3': [np.NaN,1,np.NaN,1,0,0,1,0,np.NaN,np.NaN]
                   })

Which looks like this:

  id   q1    q2   q3
0  a  1.0   low  NaN
1  a  1.0   NaN  1.0
2  b  NaN   NaN  NaN
3  b  NaN  high  1.0
4  b  0.0   low  0.0
5  c  NaN  high  0.0
6  d  1.0  high  1.0
7  e  NaN   NaN  0.0
8  e  1.0   NaN  NaN
9  e  0.0   low  NaN

I want to create a new dataframe that contains only 1 row from each id, but that row is the most complete (least instances of NaN), but if theyre equally complete, then the first occurrence in the current sort order

Ideal output is a new dataframe:

  id   q1    q2   q3
0  a  1.0   low  NaN
1  b  0.0   low  0.0
2  c  NaN  high  0.0
3  d  1.0  high  1.0
4  e  0.0   low  NaN

I can count the number of NA in each row using df.isnull().sum(axis=1) but I’m not sure how to use that to then select out the row with the smallest sum, especially if there are more than 2 entries for an id

Asked By: Simon

||

Answers:

You could use a surrogate column to sort based on counts and filter with a groupby.

df = df.assign(count=df.isnull().sum(1))
       .sort_values(['id', 'count'])
       .groupby('id', as_index=0).head(1)
       .drop('count', 1)

print(df)
  id   q1    q2   q3
0  a  1.0   low  NaN
4  b  0.0   low  0.0
5  c  NaN  high  0.0
6  d  1.0  high  1.0
9  e  0.0   low  NaN
Answered By: cs95

This is what I am going to do, drop_duplicates, you can drop the Notnullvalue by using .drop('Notnullvalue', 1)

df['Notnullvalue'] = df.isnull().sum(1)
df.sort_values(['id', 'Notnullvalue']).drop_duplicates(['id'], keep='first')
Out[15]: 
  id   q1    q2   q3  Notnullvalue
0  a  1.0   low  NaN             1
4  b  0.0   low  0.0             0
5  c  NaN  high  0.0             1
6  d  1.0  high  1.0             0
9  e  0.0   low  NaN             1
Answered By: BENY

Inspired by @COLDSPEED, I have such a solution. Note na_position='last' is the default setting in sort_values.

df.sort_values(by=['q1','q2','q3'], na_position='last').groupby('id').head(1).sort_index()
Answered By: luoshao23