Which one is preferable? np.where or .loc?
Question:
I found two forms of replacing some values of a data frame based on a condition:
- .loc
mask = df['param'].isnull()
df.loc[mask, 'param'] = 'new_value'
- np.where()
mask = df['param'].isnull()
df['param'] = np.where(mask, 'new_value', df['param'])
Both forms work well, but which is the preferred one? And in relation to the question, when should I use .loc and when np.where?
Answers:
Well, not a throughout test, but here’s a sample. In each run (loc
, np.where
), the data is reset to the original random with seed.
toy data 1
Here, there are more np.nan
than valid values. Also, the column is of float type.
np.random.seed(1)
df = pd.DataFrame({'param': np.random.choice((1, np.nan), 1000000, p=(0.3,0.7))})
# loc
%%timeit
mask = df['param'].isnull()
df.loc[mask, 'param'] = 'new_value'
# 46.7 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# np.where
%%timeit
mask = df['param'].isnull()
df['param'] = np.where(mask, 'new_value', df['param'])
# 86.8 ms ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
toy data 2:
Here there are less np.nan
than valid values, and the column is of object type:
np.random.seed(1)
df = pd.DataFrame({'param': np.random.choice(("1", np.nan), 1000000, p=(0.7,0.3))})
same story:
df.loc[mask, 'param'] = 'new_value'
# 47.8 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['param'] = np.where(mask, 'new_value', df['param'])
# 58.9 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So contrary to @cs95’s comment, loc
seems to outperform np.where
.
The code runs in jupyter notebook
np.random.seed(42)
df1 = pd.DataFrame({'a':np.random.randint(0, 10, 10000)})
%%timeit
df1["a"] = np.where(df1["a"] == 2, 8, df1["a"])
# 163 µs ± 3.47 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
df1.loc[df1['a']==2,'a'] = 8
# 203 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
df1.loc[np.where(df1.a.values==2)]['a'] = 8
# 383 µs ± 9.44 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# I have a question about this, Why does df1.loc[np.where(df1.a.values==2), 'a']= 8 report an error
%%timeit
df1.iloc[np.where(df1.a.values==2),0] = 8
# 101 µs ± 870 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
I have a question about the third way of writing, Why does df1.loc[np.where(df1.a.values==2), ‘a’]= 8 report an error
I found two forms of replacing some values of a data frame based on a condition:
- .loc
mask = df['param'].isnull()
df.loc[mask, 'param'] = 'new_value'
- np.where()
mask = df['param'].isnull()
df['param'] = np.where(mask, 'new_value', df['param'])
Both forms work well, but which is the preferred one? And in relation to the question, when should I use .loc and when np.where?
Well, not a throughout test, but here’s a sample. In each run (loc
, np.where
), the data is reset to the original random with seed.
toy data 1
Here, there are more np.nan
than valid values. Also, the column is of float type.
np.random.seed(1)
df = pd.DataFrame({'param': np.random.choice((1, np.nan), 1000000, p=(0.3,0.7))})
# loc
%%timeit
mask = df['param'].isnull()
df.loc[mask, 'param'] = 'new_value'
# 46.7 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# np.where
%%timeit
mask = df['param'].isnull()
df['param'] = np.where(mask, 'new_value', df['param'])
# 86.8 ms ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
toy data 2:
Here there are less np.nan
than valid values, and the column is of object type:
np.random.seed(1)
df = pd.DataFrame({'param': np.random.choice(("1", np.nan), 1000000, p=(0.7,0.3))})
same story:
df.loc[mask, 'param'] = 'new_value'
# 47.8 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['param'] = np.where(mask, 'new_value', df['param'])
# 58.9 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So contrary to @cs95’s comment, loc
seems to outperform np.where
.
The code runs in jupyter notebook
np.random.seed(42)
df1 = pd.DataFrame({'a':np.random.randint(0, 10, 10000)})
%%timeit
df1["a"] = np.where(df1["a"] == 2, 8, df1["a"])
# 163 µs ± 3.47 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
df1.loc[df1['a']==2,'a'] = 8
# 203 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
df1.loc[np.where(df1.a.values==2)]['a'] = 8
# 383 µs ± 9.44 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# I have a question about this, Why does df1.loc[np.where(df1.a.values==2), 'a']= 8 report an error
%%timeit
df1.iloc[np.where(df1.a.values==2),0] = 8
# 101 µs ± 870 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
I have a question about the third way of writing, Why does df1.loc[np.where(df1.a.values==2), ‘a’]= 8 report an error