Fast way to get index of non-blank values in row/column
Question:
Let’s say we have the following pandas dataframe:
df = pd.DataFrame({'a': {0: 3.0, 1: 2.0, 2: None}, 'b': {0: 10.0, 1: None, 2: 8.0}, 'c': {0: 4.0, 1: 2.0, 2: 6.0}})
a b c
0 3.0 10.0 4.0
1 2.0 NaN 2.0
2 NaN 8.0 6.0
I need to get a dataframe with, for each row, the column names of all non-NaN values.
I know I can do the following, which produces the expected outupt:
df2 = df.apply(lambda x: pd.Series(x.dropna().index), axis=1)
0 1 2
0 a b c
1 a c NaN
2 b c NaN
Unfortunately, this is quite slow with large datasets. Is there a faster way?
Getting the row indices of non-Null values of each column could work too, as I would just need to transpose the input dataframe. Thanks
Answers:
Use numpy:
m = df.notna()
a = m.mul(df.columns).where(m).to_numpy()
out = pd.DataFrame(a[np.arange(len(a))[:,None], np.argsort(~m, axis=1)],
index=df.index)
Output:
0 1 2
0 a b c
1 a c NaN
2 b c NaN
timings
On 30k rows x 3 columns:
# numpy approach
6.82 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# pandas apply
7.32 s ± 553 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Let’s say we have the following pandas dataframe:
df = pd.DataFrame({'a': {0: 3.0, 1: 2.0, 2: None}, 'b': {0: 10.0, 1: None, 2: 8.0}, 'c': {0: 4.0, 1: 2.0, 2: 6.0}})
a b c
0 3.0 10.0 4.0
1 2.0 NaN 2.0
2 NaN 8.0 6.0
I need to get a dataframe with, for each row, the column names of all non-NaN values.
I know I can do the following, which produces the expected outupt:
df2 = df.apply(lambda x: pd.Series(x.dropna().index), axis=1)
0 1 2
0 a b c
1 a c NaN
2 b c NaN
Unfortunately, this is quite slow with large datasets. Is there a faster way?
Getting the row indices of non-Null values of each column could work too, as I would just need to transpose the input dataframe. Thanks
Use numpy:
m = df.notna()
a = m.mul(df.columns).where(m).to_numpy()
out = pd.DataFrame(a[np.arange(len(a))[:,None], np.argsort(~m, axis=1)],
index=df.index)
Output:
0 1 2
0 a b c
1 a c NaN
2 b c NaN
timings
On 30k rows x 3 columns:
# numpy approach
6.82 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# pandas apply
7.32 s ± 553 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)