Shallow copy in Pandas

Question:

Pandas version 1.5.3

Problem: shallow copy should as result assign values from copied df, which on this example not working:

df = pd.DataFrame({
'A': [1,1, 1],
'B': [2,2, 2]
})
df2 = df.copy(False)
df['A'] = [10,10,10]

result:

df2 :
A B
1 2
1 2
1 2

Expected result:

df2:
A B
10 2
10 2
10 2

by the way, when I am doing this:

df['A'] += [10,10,10] # df2 result is correct
df2:
A B
10 2
10 2
10 2

also when I am changing specific value in series its also working

df['A'][0] = 10 
df2:
A  B
10 2
1  2
1  2

Could you explain why

this code: df['A'] = [10,10,10] is not updating data for shallow copy of this df and other provided examples works?

Asked By: sygneto

||

Answers:

A shallow copy creates a new object but shares the data with the original data. Any changes made to the copy will also affect the original data.

In the example you provided, the line df2 = df.copy(False) creates a shallow copy of df. This means that df2 shares the data with df, so any changes made to df will also affect df2.

However, the line df[‘A’] = [10,10,10] creates a new object for the column ‘A’ in df. This means that df[‘A’] now refers to a completely new object that is separate from the original data that df2 is referencing. As a result, df2 is not affected by this change because it is still referencing the original data.

On the other hand, when you use the += operator, you are modifying the original object that df[‘A’] references rather than creating a new object. Since df2 and df both reference the same object, any changes made to the object will be reflected in both dataframes.

Similarly, when you change a specific value in the series, you are modifying the original object that df[‘A’] references, which is shared by both dataframes. This is why the change is reflected in df2.

To get the expected result where the changes to df are also reflected in df2, you can modify df in a way that modifies the original object that df2 references. One way to do this is by using the loc accessor to modify the values in place, like so:
df.loc[:, ‘A’] = [10, 10, 10]

Answered By: utkarshx27

There is an explanation for this.
First of lets look what df2 = df.copy(False) means:

  • When deep=False, a new object will be created without copying the
    calling object’s data or index (only references to the data and index
    are copied). Any changes to the data of the original will be
    reflected in the shallow copy (and vice versa).

This means if the values in df or df2 gets changed it will have effect on both dataframes.
The effects are visible when this command df['A'] += [10,10,10] or df['A'] += 10 is used. It is only hard to see what happens because the 10s all look the same.

Lets try this:

df = pd.DataFrame({
'A': ["a","b", "c"],
'B': [2,2, 2]
})
df2 = df.copy(False)
df['A'] += "10"

now df['A'] += "10" returns for both dataframes:

     A  B
0  a10  2
1  b10  2
2  c10  2

But if df[‘A’] = ["c","d","e"] is used then you will get different outputs for df and df2. This is because now you have changed not modified the values of ["a","b","c"] but you have replaced the data and its reference itself. See description from the beginning again.

Now lets come to df.loc[:,'A'] = ["c","d","e"]. How come this changes the values for both Dataframes? When using df.loc[:,'A'] we select the Data with its Reference and replace its values by new Values. This is different from replacing the entire column ['A'] with new data which has no reference to the old data.

To learn the differences throughout it takes some experience since it is not always obvious why 2 seemingly identical commands act different.

Answered By: tetris programming
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.