Which one is preferable? np.where or .loc?

Question

I found two forms of replacing some values of a data frame based on a condition:

.loc

mask = df['param'].isnull()
df.loc[mask, 'param'] = 'new_value'

np.where()

mask = df['param'].isnull()
df['param'] = np.where(mask, 'new_value', df['param'])

Both forms work well, but which is the preferred one? And in relation to the question, when should I use .loc and when np.where?

Asked By: Haritz Laboa

||

Source

Answer 1

Well, not a throughout test, but here’s a sample. In each run (loc, np.where), the data is reset to the original random with seed.

toy data 1

Here, there are more np.nan than valid values. Also, the column is of float type.

np.random.seed(1)
df = pd.DataFrame({'param': np.random.choice((1, np.nan), 1000000, p=(0.3,0.7))})

# loc
%%timeit
mask = df['param'].isnull()
df.loc[mask, 'param'] = 'new_value'
# 46.7 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# np.where
%%timeit
mask = df['param'].isnull()
df['param'] = np.where(mask, 'new_value', df['param'])
# 86.8 ms ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

toy data 2:

Here there are less np.nan than valid values, and the column is of object type:

np.random.seed(1)
df = pd.DataFrame({'param': np.random.choice(("1", np.nan), 1000000, p=(0.7,0.3))})

same story:

df.loc[mask, 'param'] = 'new_value'
# 47.8 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

df['param'] = np.where(mask, 'new_value', df['param'])
# 58.9 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So contrary to @cs95’s comment, loc seems to outperform np.where.

Answered By: Quang Hoang

Answer 2

The code runs in jupyter notebook

np.random.seed(42)
df1 = pd.DataFrame({'a':np.random.randint(0, 10, 10000)})

%%timeit
df1["a"] = np.where(df1["a"] == 2, 8, df1["a"])
# 163 µs ± 3.47 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%%timeit
df1.loc[df1['a']==2,'a'] = 8
# 203 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
df1.loc[np.where(df1.a.values==2)]['a'] = 8
# 383 µs ± 9.44 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# I have a question about this, Why does df1.loc[np.where(df1.a.values==2), 'a']= 8 report an error

%%timeit
df1.iloc[np.where(df1.a.values==2),0] = 8
# 101 µs ± 870 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

I have a question about the third way of writing, Why does df1.loc[np.where(df1.a.values==2), ‘a’]= 8 report an error

Answered By: David Wei

Which one is preferable? np.where or .loc?

Question:

Answers:

toy data 1

toy data 2: