handling zeros in pandas DataFrames column divisions in Python
Question:
What’s the best way to handle zero denominators when dividing pandas DataFrame columns by each other in Python? for example:
df = pandas.DataFrame({"a": [1, 2, 0, 1, 5], "b": [0, 10, 20, 30, 50]})
df.a / df.b # yields error
I’d like the ratios where the denominator is zero to be registered as NA (numpy.nan
). How can this be done efficiently in pandas?
Casting to float64
does not work at level of columns:
In [29]: df
Out[29]:
a b
0 1 0
1 2 10
2 0 20
3 1 30
4 5 50
In [30]: df["a"].astype("float64") / df["b"].astype("float64")
...
FloatingPointError: divide by zero encountered in divide
How can I do it just for particular columns and not entire df?
Answers:
You need to work in floats, otherwise you will have integer division, prob not what you want
In [12]: df = pandas.DataFrame({"a": [1, 2, 0, 1, 5],
"b": [0, 10, 20, 30, 50]}).astype('float64')
In [13]: df
Out[13]:
a b
0 1 0
1 2 10
2 0 20
3 1 30
4 5 50
In [14]: df.dtypes
Out[14]:
a float64
b float64
dtype: object
Here’s one way
In [15]: x = df.a/df.b
In [16]: x
Out[16]:
0 inf
1 0.200000
2 0.000000
3 0.033333
4 0.100000
dtype: float64
In [17]: x[np.isinf(x)] = np.nan
In [18]: x
Out[18]:
0 NaN
1 0.200000
2 0.000000
3 0.033333
4 0.100000
dtype: float64
Here’s another way
In [20]: df.a/df.b.replace({ 0 : np.nan })
Out[20]:
0 NaN
1 0.200000
2 0.000000
3 0.033333
4 0.100000
dtype: float64
Just for completeness, I would like to add the following way of division that uses DataFrame.apply like:
df.loc[:, 'c'] = df.apply(div('a', 'b'), axis=1)
In full:
In [1]:
df = pd.DataFrame({"a": [1, 2, 0, 1, 5, 0], "b": [0, 10, 20, 30, 50, 0]}).astype('float64')
def div(numerator, denominator):
return lambda row: 0.0 if row[denominator] == 0 else float(row[numerator]/row[denominator])
df.loc[:, 'c'] = df.apply(div('a', 'b'), axis=1)
Out[1]:
a b c
0 1.0 0.0 0.000000
1 2.0 10.0 0.200000
2 0.0 20.0 0.000000
3 1.0 30.0 0.033333
4 5.0 50.0 0.100000
5 0.0 0.0 0.000000
This solution is slower than the one proposed by Jeff:
df.loc[:, 'c'] = df.apply(div('a', 'b'), axis=1)
# 1.27 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df.loc[:, 'c'] = df.a/df.b.replace({ 0 : np.inf })
# 651 µs ± 44.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
To spell out the different behavior of pandas for columns of different dtypes, note that pandas supports division by zero for columns with numeric dtype (such as float and int64) by returning a result of inf, but for columns of object type, it raises a ZeroDivisionError exception.
See my answer to a related question for examples.
What’s the best way to handle zero denominators when dividing pandas DataFrame columns by each other in Python? for example:
df = pandas.DataFrame({"a": [1, 2, 0, 1, 5], "b": [0, 10, 20, 30, 50]})
df.a / df.b # yields error
I’d like the ratios where the denominator is zero to be registered as NA (numpy.nan
). How can this be done efficiently in pandas?
Casting to float64
does not work at level of columns:
In [29]: df
Out[29]:
a b
0 1 0
1 2 10
2 0 20
3 1 30
4 5 50
In [30]: df["a"].astype("float64") / df["b"].astype("float64")
...
FloatingPointError: divide by zero encountered in divide
How can I do it just for particular columns and not entire df?
You need to work in floats, otherwise you will have integer division, prob not what you want
In [12]: df = pandas.DataFrame({"a": [1, 2, 0, 1, 5],
"b": [0, 10, 20, 30, 50]}).astype('float64')
In [13]: df
Out[13]:
a b
0 1 0
1 2 10
2 0 20
3 1 30
4 5 50
In [14]: df.dtypes
Out[14]:
a float64
b float64
dtype: object
Here’s one way
In [15]: x = df.a/df.b
In [16]: x
Out[16]:
0 inf
1 0.200000
2 0.000000
3 0.033333
4 0.100000
dtype: float64
In [17]: x[np.isinf(x)] = np.nan
In [18]: x
Out[18]:
0 NaN
1 0.200000
2 0.000000
3 0.033333
4 0.100000
dtype: float64
Here’s another way
In [20]: df.a/df.b.replace({ 0 : np.nan })
Out[20]:
0 NaN
1 0.200000
2 0.000000
3 0.033333
4 0.100000
dtype: float64
Just for completeness, I would like to add the following way of division that uses DataFrame.apply like:
df.loc[:, 'c'] = df.apply(div('a', 'b'), axis=1)
In full:
In [1]:
df = pd.DataFrame({"a": [1, 2, 0, 1, 5, 0], "b": [0, 10, 20, 30, 50, 0]}).astype('float64')
def div(numerator, denominator):
return lambda row: 0.0 if row[denominator] == 0 else float(row[numerator]/row[denominator])
df.loc[:, 'c'] = df.apply(div('a', 'b'), axis=1)
Out[1]:
a b c
0 1.0 0.0 0.000000
1 2.0 10.0 0.200000
2 0.0 20.0 0.000000
3 1.0 30.0 0.033333
4 5.0 50.0 0.100000
5 0.0 0.0 0.000000
This solution is slower than the one proposed by Jeff:
df.loc[:, 'c'] = df.apply(div('a', 'b'), axis=1)
# 1.27 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df.loc[:, 'c'] = df.a/df.b.replace({ 0 : np.inf })
# 651 µs ± 44.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
To spell out the different behavior of pandas for columns of different dtypes, note that pandas supports division by zero for columns with numeric dtype (such as float and int64) by returning a result of inf, but for columns of object type, it raises a ZeroDivisionError exception.
See my answer to a related question for examples.