How to set all the values of an existing Pandas DataFrame to zero?
Question:
I currently have an existing Pandas DataFrame with a date index, and columns each with a specific name.
As for the data cells, they are filled with various float values.
I would like to copy my DataFrame, but replace all these values with zero.
The objective is to reuse the structure of the DataFrame (dimensions, index, column names), but clear all the current values by replacing them with zeroes.
The way I’m currently achieving this is as follow:
df[df > 0] = 0
However, this would not replace any negative value in the DataFrame.
Isn’t there a more general approach to filling an entire existing DataFrame with a single common value?
Thank you in advance for your help.
Answers:
The absolute fastest way, which also preserves dtypes
, is the following:
for col in df.columns:
df[col].values[:] = 0
This directly writes to the underlying numpy array of each column. I doubt any other method will be faster than this, as this allocates no additional storage and doesn’t pass through pandas’s dtype
handling. You can also use np.issubdtype
to only zero out numeric columns. This is probably what you want if you have a mixed dtype
DataFrame, but of course it’s not necessary if your DataFrame is already entirely numeric.
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
For small DataFrames, the subtype check is somewhat costly. However, the cost of zeroing a non-numeric column is substantial, so if you’re not sure whether your DataFrame is entirely numeric, you should probably include the issubdtype
check.
Timing comparisons
Setup
import pandas as pd
import numpy as np
def make_df(n, only_numeric):
series = [
pd.Series(range(n), name="int", dtype=int),
pd.Series(range(n), name="float", dtype=float),
]
if only_numeric:
series.extend(
[
pd.Series(range(n, 2 * n), name="int2", dtype=int),
pd.Series(range(n, 2 * n), name="float2", dtype=float),
]
)
else:
series.extend(
[
pd.date_range(start="1970-1-1", freq="T", periods=n, name="dt")
.to_series()
.reset_index(drop=True),
pd.Series(
[chr((i % 26) + 65) for i in range(n)],
name="string",
dtype="object",
),
]
)
return pd.concat(series, axis=1)
>>> make_df(5, True)
int float int2 float2
0 0 0.0 5 5.0
1 1 1.0 6 6.0
2 2 2.0 7 7.0
3 3 3.0 8 8.0
4 4 4.0 9 9.0
>>> make_df(5, False)
int float dt string
0 0 0.0 1970-01-01 00:00:00 A
1 1 1.0 1970-01-01 00:01:00 B
2 2 2.0 1970-01-01 00:02:00 C
3 3 3.0 1970-01-01 00:03:00 D
4 4 4.0 1970-01-01 00:04:00 E
Small DataFrame
n = 10_000
# Numeric df, no issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
df[col].values[:] = 0
36.1 µs ± 510 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Numeric df, yes issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
53 µs ± 645 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Non-numeric df, no issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
df[col].values[:] = 0
113 µs ± 391 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Non-numeric df, yes issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
39.4 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Large DataFrame
n = 10_000_000
# Numeric df, no issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
df[col].values[:] = 0
38.7 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Numeric df, yes issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
39.1 ms ± 556 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Non-numeric df, no issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
df[col].values[:] = 0
99.5 ms ± 748 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Non-numeric df, yes issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
17.8 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I’d previously suggested the answer below, but I now consider it harmful — it’s significantly slower than the above answers and is harder to reason about. Its only advantage is being nicer to write.
The cleanest way is to use a bare colon to reference the entire
dataframe.
df[:] = 0
Unfortunately the dtype
situation is a bit fuzzy because every
column in the resulting dataframe will have the same dtype
. If every
column of df
was originally float
, the new dtypes
will still be
float
. But if a single column was int
or object
, it seems that
the new dtypes
will all be int
.
You can use the replace function:
df2 = df.replace(df, 0)
Since you are trying to make a copy, it might be better to simply create a new data frame with values as 0, and columns and index from the original data frame:
pd.DataFrame(0, columns=df.columns, index=df.index)
FYI the accepted answer from BallpointBen was almost 2 orders of magnitude faster for me than the .replace() operation offered by Joe T Boka. Both are helpful. Thanks!
To be clear, the fast way described by BallpointBen is:
for col in df.columns:
df[col].values[:] = 0
*I would have commented this but I don’t have enough street cred/reputation yet since I have been lurking for years. I used timeit.timeit() for the comparison.
Easy sample.
def zeros_like(df):
new_df = df.copy()
for col in new_df.columns:
new_df[col].values[:] = 0
return new_df
Late to post but just wanted to share an alternate way without using any loops
df.iloc[:] = 0
This can be achieved by multiplying the dataframe by 0
df = df * 0
I currently have an existing Pandas DataFrame with a date index, and columns each with a specific name.
As for the data cells, they are filled with various float values.
I would like to copy my DataFrame, but replace all these values with zero.
The objective is to reuse the structure of the DataFrame (dimensions, index, column names), but clear all the current values by replacing them with zeroes.
The way I’m currently achieving this is as follow:
df[df > 0] = 0
However, this would not replace any negative value in the DataFrame.
Isn’t there a more general approach to filling an entire existing DataFrame with a single common value?
Thank you in advance for your help.
The absolute fastest way, which also preserves dtypes
, is the following:
for col in df.columns:
df[col].values[:] = 0
This directly writes to the underlying numpy array of each column. I doubt any other method will be faster than this, as this allocates no additional storage and doesn’t pass through pandas’s dtype
handling. You can also use np.issubdtype
to only zero out numeric columns. This is probably what you want if you have a mixed dtype
DataFrame, but of course it’s not necessary if your DataFrame is already entirely numeric.
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
For small DataFrames, the subtype check is somewhat costly. However, the cost of zeroing a non-numeric column is substantial, so if you’re not sure whether your DataFrame is entirely numeric, you should probably include the issubdtype
check.
Timing comparisons
Setup
import pandas as pd
import numpy as np
def make_df(n, only_numeric):
series = [
pd.Series(range(n), name="int", dtype=int),
pd.Series(range(n), name="float", dtype=float),
]
if only_numeric:
series.extend(
[
pd.Series(range(n, 2 * n), name="int2", dtype=int),
pd.Series(range(n, 2 * n), name="float2", dtype=float),
]
)
else:
series.extend(
[
pd.date_range(start="1970-1-1", freq="T", periods=n, name="dt")
.to_series()
.reset_index(drop=True),
pd.Series(
[chr((i % 26) + 65) for i in range(n)],
name="string",
dtype="object",
),
]
)
return pd.concat(series, axis=1)
>>> make_df(5, True)
int float int2 float2
0 0 0.0 5 5.0
1 1 1.0 6 6.0
2 2 2.0 7 7.0
3 3 3.0 8 8.0
4 4 4.0 9 9.0
>>> make_df(5, False)
int float dt string
0 0 0.0 1970-01-01 00:00:00 A
1 1 1.0 1970-01-01 00:01:00 B
2 2 2.0 1970-01-01 00:02:00 C
3 3 3.0 1970-01-01 00:03:00 D
4 4 4.0 1970-01-01 00:04:00 E
Small DataFrame
n = 10_000
# Numeric df, no issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
df[col].values[:] = 0
36.1 µs ± 510 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Numeric df, yes issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
53 µs ± 645 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Non-numeric df, no issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
df[col].values[:] = 0
113 µs ± 391 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Non-numeric df, yes issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
39.4 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Large DataFrame
n = 10_000_000
# Numeric df, no issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
df[col].values[:] = 0
38.7 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Numeric df, yes issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
39.1 ms ± 556 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Non-numeric df, no issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
df[col].values[:] = 0
99.5 ms ± 748 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Non-numeric df, yes issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
17.8 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I’d previously suggested the answer below, but I now consider it harmful — it’s significantly slower than the above answers and is harder to reason about. Its only advantage is being nicer to write.
The cleanest way is to use a bare colon to reference the entire
dataframe.df[:] = 0
Unfortunately the
dtype
situation is a bit fuzzy because every
column in the resulting dataframe will have the samedtype
. If every
column ofdf
was originallyfloat
, the newdtypes
will still be
float
. But if a single column wasint
orobject
, it seems that
the newdtypes
will all beint
.
You can use the replace function:
df2 = df.replace(df, 0)
Since you are trying to make a copy, it might be better to simply create a new data frame with values as 0, and columns and index from the original data frame:
pd.DataFrame(0, columns=df.columns, index=df.index)
FYI the accepted answer from BallpointBen was almost 2 orders of magnitude faster for me than the .replace() operation offered by Joe T Boka. Both are helpful. Thanks!
To be clear, the fast way described by BallpointBen is:
for col in df.columns:
df[col].values[:] = 0
*I would have commented this but I don’t have enough street cred/reputation yet since I have been lurking for years. I used timeit.timeit() for the comparison.
Easy sample.
def zeros_like(df):
new_df = df.copy()
for col in new_df.columns:
new_df[col].values[:] = 0
return new_df
Late to post but just wanted to share an alternate way without using any loops
df.iloc[:] = 0
This can be achieved by multiplying the dataframe by 0
df = df * 0