Pandas dataframe, change column order using reindex does give expected result in for loop
Question:
I have two dataframes that look the same and for both of them I want to add an additional column and then reorder the columns. Here is a sample of what I tried to accomplish this:
data=[[1,2],[3,4]]
cols=['col1','col2']
df1=pd.DataFrame(data,columns=cols)
df2=pd.DataFrame(data,columns=cols)
for df in [df1,df2]:
df.loc[:,'col3']=[5,6]
df=df.reindex(['col3','col2','col1'],axis=1)
print(df1)
col1 col2 col3
0 1 2 5
1 3 4 6
print(df2)
col1 col2 col3
0 1 2 5
1 3 4 6
The third column was added as expected but the columns are still in the original order. I expected them to be col3, col2, col1. When I tried this later on the reindex worked as expected:
df1=df1.reindex(['col3','col2','col1'],axis=1)
I’m sure there is an explanation to why the column gets added but the reindex is ignored in my first attempt, but I have not been able to find one. Does anyone know why this happens?
Answers:
This is because df in your for loop is a local variable. When you do df.loc[:,'col3']=[5,6]
, you do a modification to the thing df
references, which therefore affects df1
. However, doing
df.reindex(['col3','col2','col1'],axis=1)
does not modify the original DataFrame but creates a new copy of it, which is then assigned to the local variable df
inside the for loop. However, df1
and df2
remain unchanged. To see this, you can try printing df
at the end of the for loop. It should print the desired value you want for df2
(with the reindexing)
This is due to the way assignments work in Python. When you called df.loc[:,'col3']=[5,6]
, Python (correctly) interpreted this as "change loc[:,'col3']
of the object called df
to [5,6]
". On the other hand, when you called df = df.reindex(['col3', 'col2', 'col1'], axis=1)
, you were expecting this to be interpreted as "replace the object called df
with the reindexed dataframe". What it actually did was redefine the label df
so that it refers to the reindexed dataframe (without changing the object that df
used to refer to).
I have two dataframes that look the same and for both of them I want to add an additional column and then reorder the columns. Here is a sample of what I tried to accomplish this:
data=[[1,2],[3,4]]
cols=['col1','col2']
df1=pd.DataFrame(data,columns=cols)
df2=pd.DataFrame(data,columns=cols)
for df in [df1,df2]:
df.loc[:,'col3']=[5,6]
df=df.reindex(['col3','col2','col1'],axis=1)
print(df1)
col1 col2 col3
0 1 2 5
1 3 4 6
print(df2)
col1 col2 col3
0 1 2 5
1 3 4 6
The third column was added as expected but the columns are still in the original order. I expected them to be col3, col2, col1. When I tried this later on the reindex worked as expected:
df1=df1.reindex(['col3','col2','col1'],axis=1)
I’m sure there is an explanation to why the column gets added but the reindex is ignored in my first attempt, but I have not been able to find one. Does anyone know why this happens?
This is because df in your for loop is a local variable. When you do df.loc[:,'col3']=[5,6]
, you do a modification to the thing df
references, which therefore affects df1
. However, doing
df.reindex(['col3','col2','col1'],axis=1)
does not modify the original DataFrame but creates a new copy of it, which is then assigned to the local variable df
inside the for loop. However, df1
and df2
remain unchanged. To see this, you can try printing df
at the end of the for loop. It should print the desired value you want for df2
(with the reindexing)
This is due to the way assignments work in Python. When you called df.loc[:,'col3']=[5,6]
, Python (correctly) interpreted this as "change loc[:,'col3']
of the object called df
to [5,6]
". On the other hand, when you called df = df.reindex(['col3', 'col2', 'col1'], axis=1)
, you were expecting this to be interpreted as "replace the object called df
with the reindexed dataframe". What it actually did was redefine the label df
so that it refers to the reindexed dataframe (without changing the object that df
used to refer to).