Set values on the diagonal of pandas.DataFrame
Question:
I have a pandas dataframe I would like to se the diagonal to 0
import numpy
import pandas
df = pandas.DataFrame(numpy.random.rand(5,5))
df
Out[6]:
0 1 2 3 4
0 0.536596 0.674319 0.032815 0.908086 0.215334
1 0.735022 0.954506 0.889162 0.711610 0.415118
2 0.119985 0.979056 0.901891 0.687829 0.947549
3 0.186921 0.899178 0.296294 0.521104 0.638924
4 0.354053 0.060022 0.275224 0.635054 0.075738
5 rows × 5 columns
now I want to set the diagonal to 0:
for i in range(len(df.index)):
for j in range(len(df.columns)):
if i==j:
df.loc[i,j] = 0
df
Out[9]:
0 1 2 3 4
0 0.000000 0.674319 0.032815 0.908086 0.215334
1 0.735022 0.000000 0.889162 0.711610 0.415118
2 0.119985 0.979056 0.000000 0.687829 0.947549
3 0.186921 0.899178 0.296294 0.000000 0.638924
4 0.354053 0.060022 0.275224 0.635054 0.000000
5 rows × 5 columns
but there must be a more pythonic way than that!?
Answers:
In [21]: df.values[[np.arange(df.shape[0])]*2] = 0
In [22]: df
Out[22]:
0 1 2 3 4
0 0.000000 0.931374 0.604412 0.863842 0.280339
1 0.531528 0.000000 0.641094 0.204686 0.997020
2 0.137725 0.037867 0.000000 0.983432 0.458053
3 0.594542 0.943542 0.826738 0.000000 0.753240
4 0.357736 0.689262 0.014773 0.446046 0.000000
Note that this will only work if df
has the same number of rows as columns. Another way which will work for arbitrary shapes is to use np.fill_diagonal:
In [36]: np.fill_diagonal(df.values, 0)
Both approaches in unutbu’s answer assume that labels are irrelevant (they operate on the underlying values).
The OP code works with .loc
and so is label based instead (i.e. put a 0 on cells in row-column with same labels, rather than in cells located on the diagonal – admittedly, this is irrelevant in the specific example given, in which labels are just positions).
Being in need of the “label-based” diagonal filling (working with a DataFrame
describing an incomplete adjacency matrix), the simplest approach I could come up with was:
def pd_fill_diagonal(df, value):
idces = df.index.intersection(df.columns)
stacked = df.stack(dropna=False)
stacked.update(pd.Series(value,
index=pd.MultiIndex.from_arrays([idces,
idces])))
df.loc[:, :] = stacked.unstack()
Here is a hack that worked for me:
def set_diag(self, values):
n = min(len(self.index), len(self.columns))
self.values[[np.arange(n)] * 2] = values
pd.DataFrame.set_diag = set_diag
x = pd.DataFrame(np.random.randn(10, 5))
x.set_diag(0)
This solution is vectorized and very fast and unless the other suggested solution works for any column names and size of df matrix.
def pd_fill_diagonal(df_matrix, value=0):
mat = df_matrix.values
n = mat.shape[0]
mat[range(n), range(n)] = value
return pd.DataFrame(mat)
Performance on Dataframe of 507 columns and rows
% timeit pd_fill_diagonal(df, 0)
1000 loops, best of 3: 145 µs per loop
Using np.fill_diagonal(df.values, 1)
Is the easiest, but you need to make sure your columns all have the same data type I had a mixture of np.float64 and python floats and it would only effect the numpy values. to fix you have to cast everything to numpy.
All the answers given which rely on modifying DataFrame.values
are depending on undocumented behavior. The values
property is allowed to return a copy of data, but the solutions that modify values
are assuming it returns a view. Sometimes it does return a view, but the pandas documentation makes no guarantees about when it will.
Another way to accomplish this is to get the anti-identity matrix and multiply your dataframe with it.
df * abs(np.eye(len(df))-1)
Here is a way with np.identity
df.where(np.identity(df.shape[0]) != 1,0)
Output:
0 1 2 3 4
0 0.000000 0.674319 0.032815 0.908086 0.215334
1 0.735022 0.000000 0.889162 0.711610 0.415118
2 0.119985 0.979056 0.000000 0.687829 0.947549
3 0.186921 0.899178 0.296294 0.000000 0.638924
4 0.354053 0.060022 0.275224 0.635054 0.000000
I have a pandas dataframe I would like to se the diagonal to 0
import numpy
import pandas
df = pandas.DataFrame(numpy.random.rand(5,5))
df
Out[6]:
0 1 2 3 4
0 0.536596 0.674319 0.032815 0.908086 0.215334
1 0.735022 0.954506 0.889162 0.711610 0.415118
2 0.119985 0.979056 0.901891 0.687829 0.947549
3 0.186921 0.899178 0.296294 0.521104 0.638924
4 0.354053 0.060022 0.275224 0.635054 0.075738
5 rows × 5 columns
now I want to set the diagonal to 0:
for i in range(len(df.index)):
for j in range(len(df.columns)):
if i==j:
df.loc[i,j] = 0
df
Out[9]:
0 1 2 3 4
0 0.000000 0.674319 0.032815 0.908086 0.215334
1 0.735022 0.000000 0.889162 0.711610 0.415118
2 0.119985 0.979056 0.000000 0.687829 0.947549
3 0.186921 0.899178 0.296294 0.000000 0.638924
4 0.354053 0.060022 0.275224 0.635054 0.000000
5 rows × 5 columns
but there must be a more pythonic way than that!?
In [21]: df.values[[np.arange(df.shape[0])]*2] = 0
In [22]: df
Out[22]:
0 1 2 3 4
0 0.000000 0.931374 0.604412 0.863842 0.280339
1 0.531528 0.000000 0.641094 0.204686 0.997020
2 0.137725 0.037867 0.000000 0.983432 0.458053
3 0.594542 0.943542 0.826738 0.000000 0.753240
4 0.357736 0.689262 0.014773 0.446046 0.000000
Note that this will only work if df
has the same number of rows as columns. Another way which will work for arbitrary shapes is to use np.fill_diagonal:
In [36]: np.fill_diagonal(df.values, 0)
Both approaches in unutbu’s answer assume that labels are irrelevant (they operate on the underlying values).
The OP code works with .loc
and so is label based instead (i.e. put a 0 on cells in row-column with same labels, rather than in cells located on the diagonal – admittedly, this is irrelevant in the specific example given, in which labels are just positions).
Being in need of the “label-based” diagonal filling (working with a DataFrame
describing an incomplete adjacency matrix), the simplest approach I could come up with was:
def pd_fill_diagonal(df, value):
idces = df.index.intersection(df.columns)
stacked = df.stack(dropna=False)
stacked.update(pd.Series(value,
index=pd.MultiIndex.from_arrays([idces,
idces])))
df.loc[:, :] = stacked.unstack()
Here is a hack that worked for me:
def set_diag(self, values):
n = min(len(self.index), len(self.columns))
self.values[[np.arange(n)] * 2] = values
pd.DataFrame.set_diag = set_diag
x = pd.DataFrame(np.random.randn(10, 5))
x.set_diag(0)
This solution is vectorized and very fast and unless the other suggested solution works for any column names and size of df matrix.
def pd_fill_diagonal(df_matrix, value=0):
mat = df_matrix.values
n = mat.shape[0]
mat[range(n), range(n)] = value
return pd.DataFrame(mat)
Performance on Dataframe of 507 columns and rows
% timeit pd_fill_diagonal(df, 0)
1000 loops, best of 3: 145 µs per loop
Using np.fill_diagonal(df.values, 1)
Is the easiest, but you need to make sure your columns all have the same data type I had a mixture of np.float64 and python floats and it would only effect the numpy values. to fix you have to cast everything to numpy.
All the answers given which rely on modifying DataFrame.values
are depending on undocumented behavior. The values
property is allowed to return a copy of data, but the solutions that modify values
are assuming it returns a view. Sometimes it does return a view, but the pandas documentation makes no guarantees about when it will.
Another way to accomplish this is to get the anti-identity matrix and multiply your dataframe with it.
df * abs(np.eye(len(df))-1)
Here is a way with np.identity
df.where(np.identity(df.shape[0]) != 1,0)
Output:
0 1 2 3 4
0 0.000000 0.674319 0.032815 0.908086 0.215334
1 0.735022 0.000000 0.889162 0.711610 0.415118
2 0.119985 0.979056 0.000000 0.687829 0.947549
3 0.186921 0.899178 0.296294 0.000000 0.638924
4 0.354053 0.060022 0.275224 0.635054 0.000000