Find integer index of rows with NaN in pandas dataframe
Question:
I have a pandas DataFrame like this:
a b
2011-01-01 00:00:00 1.883381 -0.416629
2011-01-01 01:00:00 0.149948 -1.782170
2011-01-01 02:00:00 -0.407604 0.314168
2011-01-01 03:00:00 1.452354 NaN
2011-01-01 04:00:00 -1.224869 -0.947457
2011-01-01 05:00:00 0.498326 0.070416
2011-01-01 06:00:00 0.401665 NaN
2011-01-01 07:00:00 -0.019766 0.533641
2011-01-01 08:00:00 -1.101303 -1.408561
2011-01-01 09:00:00 1.671795 -0.764629
Is there an efficient way to find the “integer” index of rows with NaNs? In this case the desired output should be [3, 6]
.
Answers:
For DataFrame df
:
import numpy as np
index = df['b'].index[df['b'].apply(np.isnan)]
will give you back the MultiIndex
that you can use to index back into df
, e.g.:
df['a'].ix[index[0]]
>>> 1.452354
For the integer index:
df_index = df.index.values.tolist()
[df_index.index(i) for i in index]
>>> [3, 6]
Here is a simpler solution:
inds = pd.isnull(df).any(1).nonzero()[0]
In [9]: df
Out[9]:
0 1
0 0.450319 0.062595
1 -0.673058 0.156073
2 -0.871179 -0.118575
3 0.594188 NaN
4 -1.017903 -0.484744
5 0.860375 0.239265
6 -0.640070 NaN
7 -0.535802 1.632932
8 0.876523 -0.153634
9 -0.686914 0.131185
In [10]: pd.isnull(df).any(1).nonzero()[0]
Out[10]: array([3, 6])
And just in case, if you want to find the coordinates of ‘nan’ for all the columns instead (supposing they are all numericals), here you go:
df = pd.DataFrame([[0,1,3,4,np.nan,2],[3,5,6,np.nan,3,3]])
df
0 1 2 3 4 5
0 0 1 3 4.0 NaN 2
1 3 5 6 NaN 3.0 3
np.where(np.asanyarray(np.isnan(df)))
(array([0, 1]), array([4, 3]))
Here is another simpler take:
df = pd.DataFrame([[0,1,3,4,np.nan,2],[3,5,6,np.nan,3,3]])
inds = np.asarray(df.isnull()).nonzero()
(array([0, 1], dtype=int64), array([4, 3], dtype=int64))
Don’t know if this is too late but you can use np.where to find the indices of non values as such:
indices = list(np.where(df['b'].isna()[0]))
I was looking for all indexes of rows with NaN values.
My working solution:
def get_nan_indexes(data_frame):
indexes = []
print(data_frame)
for column in data_frame:
index = data_frame[column].index[data_frame[column].apply(np.isnan)]
if len(index):
indexes.append(index[0])
df_index = data_frame.index.values.tolist()
return [df_index.index(i) for i in set(indexes)]
One line solution. However it works for one column only.
df.loc[pandas.isna(df["b"]), :].index
in the case you have datetime index and you want to have the values:
df.loc[pd.isnull(df).any(1), :].index.values
Let the dataframe be named df and the column of interest(i.e. the column in which we are trying to find nulls) is ‘b’. Then the following snippet gives the desired index of null in the dataframe:
for i in range(df.shape[0]):
if df['b'].isnull().iloc[i]:
print(i)
Here are tests for a few methods:
%timeit np.where(np.isnan(df['b']))[0]
%timeit pd.isnull(df['b']).nonzero()[0]
%timeit np.where(df['b'].isna())[0]
%timeit df.loc[pd.isna(df['b']), :].index
And their corresponding timings:
333 µs ± 9.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
280 µs ± 220 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
313 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
6.84 ms ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It would appear that pd.isnull(df['DRGWeight']).nonzero()[0]
wins the day in terms of timing, but that any of the top three methods have comparable performance.
Another simple solution is list(np.where(df['b'].isnull())[0])
This will give you the index values for nan in every column:
df.loc[pd.isna(df).any(1), :].index
index_nan = []
for index, bool_v in df["b"].iteritems().isna():
if bool_v == True:
index_nan.append(index)
print(index_nan)
I have a pandas DataFrame like this:
a b
2011-01-01 00:00:00 1.883381 -0.416629
2011-01-01 01:00:00 0.149948 -1.782170
2011-01-01 02:00:00 -0.407604 0.314168
2011-01-01 03:00:00 1.452354 NaN
2011-01-01 04:00:00 -1.224869 -0.947457
2011-01-01 05:00:00 0.498326 0.070416
2011-01-01 06:00:00 0.401665 NaN
2011-01-01 07:00:00 -0.019766 0.533641
2011-01-01 08:00:00 -1.101303 -1.408561
2011-01-01 09:00:00 1.671795 -0.764629
Is there an efficient way to find the “integer” index of rows with NaNs? In this case the desired output should be [3, 6]
.
For DataFrame df
:
import numpy as np
index = df['b'].index[df['b'].apply(np.isnan)]
will give you back the MultiIndex
that you can use to index back into df
, e.g.:
df['a'].ix[index[0]]
>>> 1.452354
For the integer index:
df_index = df.index.values.tolist()
[df_index.index(i) for i in index]
>>> [3, 6]
Here is a simpler solution:
inds = pd.isnull(df).any(1).nonzero()[0]
In [9]: df
Out[9]:
0 1
0 0.450319 0.062595
1 -0.673058 0.156073
2 -0.871179 -0.118575
3 0.594188 NaN
4 -1.017903 -0.484744
5 0.860375 0.239265
6 -0.640070 NaN
7 -0.535802 1.632932
8 0.876523 -0.153634
9 -0.686914 0.131185
In [10]: pd.isnull(df).any(1).nonzero()[0]
Out[10]: array([3, 6])
And just in case, if you want to find the coordinates of ‘nan’ for all the columns instead (supposing they are all numericals), here you go:
df = pd.DataFrame([[0,1,3,4,np.nan,2],[3,5,6,np.nan,3,3]])
df
0 1 2 3 4 5
0 0 1 3 4.0 NaN 2
1 3 5 6 NaN 3.0 3
np.where(np.asanyarray(np.isnan(df)))
(array([0, 1]), array([4, 3]))
Here is another simpler take:
df = pd.DataFrame([[0,1,3,4,np.nan,2],[3,5,6,np.nan,3,3]])
inds = np.asarray(df.isnull()).nonzero()
(array([0, 1], dtype=int64), array([4, 3], dtype=int64))
Don’t know if this is too late but you can use np.where to find the indices of non values as such:
indices = list(np.where(df['b'].isna()[0]))
I was looking for all indexes of rows with NaN values.
My working solution:
def get_nan_indexes(data_frame):
indexes = []
print(data_frame)
for column in data_frame:
index = data_frame[column].index[data_frame[column].apply(np.isnan)]
if len(index):
indexes.append(index[0])
df_index = data_frame.index.values.tolist()
return [df_index.index(i) for i in set(indexes)]
One line solution. However it works for one column only.
df.loc[pandas.isna(df["b"]), :].index
in the case you have datetime index and you want to have the values:
df.loc[pd.isnull(df).any(1), :].index.values
Let the dataframe be named df and the column of interest(i.e. the column in which we are trying to find nulls) is ‘b’. Then the following snippet gives the desired index of null in the dataframe:
for i in range(df.shape[0]):
if df['b'].isnull().iloc[i]:
print(i)
Here are tests for a few methods:
%timeit np.where(np.isnan(df['b']))[0]
%timeit pd.isnull(df['b']).nonzero()[0]
%timeit np.where(df['b'].isna())[0]
%timeit df.loc[pd.isna(df['b']), :].index
And their corresponding timings:
333 µs ± 9.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
280 µs ± 220 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
313 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
6.84 ms ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It would appear that pd.isnull(df['DRGWeight']).nonzero()[0]
wins the day in terms of timing, but that any of the top three methods have comparable performance.
Another simple solution is list(np.where(df['b'].isnull())[0])
This will give you the index values for nan in every column:
df.loc[pd.isna(df).any(1), :].index
index_nan = []
for index, bool_v in df["b"].iteritems().isna():
if bool_v == True:
index_nan.append(index)
print(index_nan)