Pandas dataframe, change column order using reindex does give expected result in for loop

Question:

I have two dataframes that look the same and for both of them I want to add an additional column and then reorder the columns. Here is a sample of what I tried to accomplish this:

data=[[1,2],[3,4]]
cols=['col1','col2']

df1=pd.DataFrame(data,columns=cols)
df2=pd.DataFrame(data,columns=cols)

for df in [df1,df2]:
    df.loc[:,'col3']=[5,6]
    df=df.reindex(['col3','col2','col1'],axis=1)

print(df1)

      col1  col2  col3
    0     1     2     5
    1     3     4     6

print(df2)

      col1  col2  col3
    0     1     2     5
    1     3     4     6

The third column was added as expected but the columns are still in the original order. I expected them to be col3, col2, col1. When I tried this later on the reindex worked as expected:

df1=df1.reindex(['col3','col2','col1'],axis=1)

I’m sure there is an explanation to why the column gets added but the reindex is ignored in my first attempt, but I have not been able to find one. Does anyone know why this happens?

Asked By: joe76

||

Answers:

This is because df in your for loop is a local variable. When you do df.loc[:,'col3']=[5,6], you do a modification to the thing df references, which therefore affects df1. However, doing
df.reindex(['col3','col2','col1'],axis=1) does not modify the original DataFrame but creates a new copy of it, which is then assigned to the local variable df inside the for loop. However, df1 and df2 remain unchanged. To see this, you can try printing df at the end of the for loop. It should print the desired value you want for df2 (with the reindexing)

Answered By: Shreyas Balaji

This is due to the way assignments work in Python. When you called df.loc[:,'col3']=[5,6], Python (correctly) interpreted this as "change loc[:,'col3'] of the object called df to [5,6]". On the other hand, when you called df = df.reindex(['col3', 'col2', 'col1'], axis=1), you were expecting this to be interpreted as "replace the object called df with the reindexed dataframe". What it actually did was redefine the label df so that it refers to the reindexed dataframe (without changing the object that df used to refer to).

Answered By: Ben Grossmann
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.