Fast way to get index of non-blank values in row/column

Question:

Let’s say we have the following pandas dataframe:

df = pd.DataFrame({'a': {0: 3.0, 1: 2.0, 2: None}, 'b': {0: 10.0, 1: None, 2: 8.0}, 'c': {0: 4.0, 1: 2.0, 2: 6.0}})

     a     b    c
0  3.0  10.0  4.0
1  2.0   NaN  2.0
2  NaN   8.0  6.0

I need to get a dataframe with, for each row, the column names of all non-NaN values.
I know I can do the following, which produces the expected outupt:

df2 = df.apply(lambda x: pd.Series(x.dropna().index), axis=1)

   0  1    2
0  a  b    c
1  a  c  NaN
2  b  c  NaN

Unfortunately, this is quite slow with large datasets. Is there a faster way?

Getting the row indices of non-Null values of each column could work too, as I would just need to transpose the input dataframe. Thanks

Asked By: younggotti

||

Answers:

Use :

m = df.notna()
a = m.mul(df.columns).where(m).to_numpy()
out = pd.DataFrame(a[np.arange(len(a))[:,None], np.argsort(~m, axis=1)],
                   index=df.index)

Output:

   0  1    2
0  a  b    c
1  a  c  NaN
2  b  c  NaN

timings

On 30k rows x 3 columns:

# numpy approach
6.82 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

# pandas apply
7.32 s ± 553 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Answered By: mozway
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.