First non-null value per row from a list of Pandas columns

Question:

If I’ve got a DataFrame in pandas which looks something like:

    A   B   C
0   1 NaN   2
1 NaN   3 NaN
2 NaN   4   5
3 NaN NaN NaN

How can I get the first non-null value from each row? E.g. for the above, I’d like to get: [1, 3, 4, None] (or equivalent Series).

Asked By: Dave Challis

||

Answers:

This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:

In [160]:
def func(x):
    if x.values[0] is None:
        return None
    else:
        return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)
​
Out[160]:
0     1
1     3
2     4
3   NaN
dtype: float64

EDIT

A slightly cleaner way:

In [12]:
def func(x):
    if x.first_valid_index() is None:
        return None
    else:
        return x[x.first_valid_index()]
df.apply(func, axis=1)

Out[12]:
0     1
1     3
2     4
3   NaN
dtype: float64
Answered By: EdChum

Here is a one line solution:

[row[row.first_valid_index()] if row.first_valid_index() else None for _, row in df.iterrows()]

Edit:

This solution iterates over rows of df. row.first_valid_index() returns label for first non-NA/null value, which will be used as index to get the first non-null item in each row.

If there is no non-null value in the row, row.first_valid_index() would be None, thus cannot be used as index, so I need a if-else statement.

I packed everything into a list comprehension for brevity.

Answered By: yangjie

This is nothing new, but it’s a combination of the best bits of @yangie’s approach with a list comprehension, and @EdChum’s df.apply approach that I think is easiest to understand.

First, which columns to we want to pick our values from?

In [95]: pick_cols = df.apply(pd.Series.first_valid_index, axis=1)

In [96]: pick_cols
Out[96]: 
0       A
1       B
2       B
3    None
dtype: object

Now how do we pick the values?

In [100]: [df.loc[k, v] if v is not None else None 
    ....:     for k, v in pick_cols.iteritems()]
Out[100]: [1.0, 3.0, 4.0, None]

This is ok, but we really want the index to match that of the original DataFrame:

In [98]: pd.Series({k:df.loc[k, v] if v is not None else None
   ....:     for k, v in pick_cols.iteritems()})
Out[98]: 
0     1
1     3
2     4
3   NaN
dtype: float64
Answered By: LondonRob

Here is another way to do it:

In [183]: df.stack().groupby(level=0).first().reindex(df.index)
Out[183]: 
0     1
1     3
2     4
3   NaN
dtype: float64

The idea here is to use stack to move the columns into a row index level:

In [184]: df.stack()
Out[184]: 
0  A    1
   C    2
1  B    3
2  B    4
   C    5
dtype: float64

Now, if you group by the first row level — i.e. the original index — and take the first value from each group, you essentially get the desired result:

In [185]: df.stack().groupby(level=0).first()
Out[185]: 
0    1
1    3
2    4
dtype: float64

All we need to do is reindex the result (using the original index) so as to
include rows that are completely NaN:

df.stack().groupby(level=0).first().reindex(df.index)
Answered By: unutbu

I’m going to weigh in here as I think this is a good deal faster than any of the proposed methods. argmin gives the index of the first False value in each row of the result of np.isnan in a vectorized way, which is the hard part. It still relies on a Python loop to extract the values but the look up is very quick:

def get_first_non_null(df):
    a = df.values
    col_index = np.isnan(a).argmin(axis=1)
    return [a[row, col] for row, col in enumerate(col_index)]

EDIT:
Here’s a fully vectorized solution which is can be a good deal faster again depending on the shape of the input. Updated benchmarking below.

def get_first_non_null_vec(df):
    a = df.values
    n_rows, n_cols = a.shape
    col_index = np.isnan(a).argmin(axis=1)
    flat_index = n_cols * np.arange(n_rows) + col_index
    return a.ravel()[flat_index]

If a row is completely null then the corresponding value will be null also.
Here’s some benchmarking against unutbu’s solution:

df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 220 ms per loop
100 loops, best of 3: 16.2 ms per loop
100 loops, best of 3: 12.6 ms per loop
In [109]:


df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 246 ms per loop
10 loops, best of 3: 48.2 ms per loop
100 loops, best of 3: 15.7 ms per loop


df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99)))
%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 326 ms per loop
1 loops, best of 3: 326 ms per loop
10 loops, best of 3: 35.7 ms per loop
Answered By: JoeCondron

JoeCondron’s answer (EDIT: before his last edit!) is cool but there is margin for significant improvement by avoiding the non-vectorized enumeration:

def get_first_non_null_vect(df):
    a = df.values
    col_index = np.isnan(a).argmin(axis=1)
    return a[np.arange(a.shape[0]), col_index]

The improvement is small if the DataFrame is relatively flat:

In [4]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))

In [5]: %timeit get_first_non_null(df)
10 loops, best of 3: 34.9 ms per loop

In [6]: %timeit get_first_non_null_vect(df)
10 loops, best of 3: 31.6 ms per loop

… but can be relevant on slim DataFrames:

In [7]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 15), p=(0.1, 0.9)))

In [8]: %timeit get_first_non_null(df)
100 loops, best of 3: 3.75 ms per loop

In [9]: %timeit get_first_non_null_vect(df)
1000 loops, best of 3: 718 µs per loop

Compared to JoeCondron’s vectorized version, the runtime is very similar (this is still slightly quicker for slim DataFrames, and slightly slower for large ones).

Answered By: Pietro Battiston

Fill the nans from the left with fillna, then get the leftmost column:

df.fillna(method='bfill', axis=1).iloc[:, 0]
Answered By: Andy Jones

groupby in axis=1

If we pass a callable that returns the same value, we group all columns together. This allows us to use groupby.agg which gives us the first method that makes this easy

df.groupby(lambda x: 'Z', 1).first()

     Z
0  1.0
1  3.0
2  4.0
3  NaN

This returns a dataframe with the column name of the thing I was returning in my callable


lookup, notna, and idxmax

df.lookup(df.index, df.notna().idxmax(1))

array([ 1.,  3.,  4., nan])

argmin and slicing

v = df.values
v[np.arange(len(df)), np.isnan(v).argmin(1)]

array([ 1.,  3.,  4., nan])
Answered By: piRSquared
df=pandas.DataFrame({'A':[1, numpy.nan, numpy.nan, numpy.nan], 'B':[numpy.nan, 3, 4, numpy.nan], 'C':[2, numpy.nan, 5, numpy.nan]})

df
     A    B    C
0  1.0  NaN  2.0
1  NaN  3.0  NaN
2  NaN  4.0  5.0
3  NaN  NaN  NaN

df.apply(lambda x: numpy.nan if all(x.isnull()) else x[x.first_valid_index()], axis=1).tolist()
[1.0, 3.0, 4.0, nan]
Answered By: bhamu
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.