pandas Finding IDs that only have NAs
Question:
I am looking to get a series/DataFrame of IDs that have only NAs within the data.
The data looks something like this:
ID
Score
Date
1
95
1-1-2022
1
nan
1-1-2022
1
nan
2-1-2022
2
nan
1-1-2022
2
100
2-1-2022
3
nan
1-1-2022
3
nan
1-1-2022
3
nan
2-1-2022
So in this case, I would only want to grab ID 3 and have one instance of it in the newly created DF.
So far, I’ve tried:
df = pd.DataFrame({'ID':[1,1,1,2,2,3,3,3],
'Score':[95,nan,nan,nan,100,nan,nan,nan],
'Date':['1-1-2022','1-1-2022','2-1-2022','1-1-2022','2-1-2022',
'1-1-2022','1-1-2022', '2-1-2022']})
df2 = pd.DataFrame(df.groupby('ID',dropna=False)['Score'].count())
df3 = df2[df2['Score'] == 0]
But this does not seem to work. Thanks for your help.
Answers:
You could create a boolean vector checking if ‘Score’ column is nan
and then you can groupby your ‘ID’ column and use transform('all')
.
This will return True
/ False
to all rows of the ID that are nan
in all rows.
>>> df['Score'].isna().groupby(df['ID']).transform('all')
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
Name: Score, dtype: bool
Using this boolean you can filter your dataframe and get ID 3 in a new df:
ids_with_all_none = df[df['Score'].isna().groupby(df['ID']).transform('all')]
ID Score Date
5 3 NaN 1-1-2022
6 3 NaN 1-1-2022
7 3 NaN 2-1-2022
It is going to work fine if you replace nan
with None
. Try this out:
df = pd.DataFrame({'ID':[1,1,1,2,2,3,3,3],
'Score':[95,None,None,None,100,None,None,None],
'Date':['1-1-2022','1-1-2022','2-1-2022','1-1-2022','2-1-2022',
'1-1-2022','1-1-2022', '2-1-2022']})
df2 = pd.DataFrame(df.groupby('ID',dropna=False, as_index=False)['Score'].count())
df3 = df2[df2['Score'] == 0]
print(df3['ID'])
I am looking to get a series/DataFrame of IDs that have only NAs within the data.
The data looks something like this:
ID | Score | Date |
---|---|---|
1 | 95 | 1-1-2022 |
1 | nan | 1-1-2022 |
1 | nan | 2-1-2022 |
2 | nan | 1-1-2022 |
2 | 100 | 2-1-2022 |
3 | nan | 1-1-2022 |
3 | nan | 1-1-2022 |
3 | nan | 2-1-2022 |
So in this case, I would only want to grab ID 3 and have one instance of it in the newly created DF.
So far, I’ve tried:
df = pd.DataFrame({'ID':[1,1,1,2,2,3,3,3],
'Score':[95,nan,nan,nan,100,nan,nan,nan],
'Date':['1-1-2022','1-1-2022','2-1-2022','1-1-2022','2-1-2022',
'1-1-2022','1-1-2022', '2-1-2022']})
df2 = pd.DataFrame(df.groupby('ID',dropna=False)['Score'].count())
df3 = df2[df2['Score'] == 0]
But this does not seem to work. Thanks for your help.
You could create a boolean vector checking if ‘Score’ column is nan
and then you can groupby your ‘ID’ column and use transform('all')
.
This will return True
/ False
to all rows of the ID that are nan
in all rows.
>>> df['Score'].isna().groupby(df['ID']).transform('all')
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
Name: Score, dtype: bool
Using this boolean you can filter your dataframe and get ID 3 in a new df:
ids_with_all_none = df[df['Score'].isna().groupby(df['ID']).transform('all')]
ID Score Date
5 3 NaN 1-1-2022
6 3 NaN 1-1-2022
7 3 NaN 2-1-2022
It is going to work fine if you replace nan
with None
. Try this out:
df = pd.DataFrame({'ID':[1,1,1,2,2,3,3,3],
'Score':[95,None,None,None,100,None,None,None],
'Date':['1-1-2022','1-1-2022','2-1-2022','1-1-2022','2-1-2022',
'1-1-2022','1-1-2022', '2-1-2022']})
df2 = pd.DataFrame(df.groupby('ID',dropna=False, as_index=False)['Score'].count())
df3 = df2[df2['Score'] == 0]
print(df3['ID'])