In Pandas, does .iloc method give a copy or view?
Question:
I find the result is a little bit random. Sometimes it’s a copy sometimes it’s a view. For example:
df = pd.DataFrame([{'name':'Marry', 'age':21},{'name':'John','age':24}],index=['student1','student2'])
df
age name
student1 21 Marry
student2 24 John
Now, Let me try to modify it a little bit.
df2= df.loc['student1']
df2 [0] = 23
df
age name
student1 21 Marry
student2 24 John
As you can see, nothing changed. df2 is a copy. However, if I add another student into the dataframe…
df.loc['student3'] = ['old','Tom']
df
age name
student1 21 Marry
student2 24 John
student3 old Tom
Try to change the age again..
df3=df.loc['student1']
df3[0]=33
df
age name
student1 33 Marry
student2 24 John
student3 old Tom
Now df3 suddenly became a view. What is going on? I guess the value ‘old’ is the key?
Answers:
You are starting with a DataFrame that has two columns with two different dtypes:
df.dtypes
Out:
age int64
name object
dtype: object
Since different dtypes are stored in different numpy arrays under the hood, you have two different blocks for them:
df.blocks
Out:
{'int64': age
student1 21
student2 24, 'object': name
student1 Marry
student2 John}
If you attempt to slice the first row of this DataFrame, it has to get one value from each different block which makes it necessary to create a copy.
df2.is_copy
Out[40]: <weakref at 0x7fc4487a9228; to 'DataFrame' at 0x7fc4488f9dd8>
In the second attempt, you are changing the dtypes. Since ‘old’ cannot be stored in an integer array, it casts the Series as an object Series.
df.loc['student3'] = ['old','Tom']
df.dtypes
Out:
age object
name object
dtype: object
Now all data for this DataFrame is stored in a single block (and in a single numpy array):
df.blocks
Out:
{'object': age name
student1 21 Marry
student2 24 John
student3 old Tom}
At this step, slicing the first row can be done on the numpy array without creating a copy, so it returns a view.
df3._is_view
Out: True
In general, you can get a view if the data-frame has a single dtype
, which is not the case with your original data-frame:
In [4]: df
Out[4]:
age name
student1 21 Marry
student2 24 John
In [5]: df.dtypes
Out[5]:
age int64
name object
dtype: object
However, when you do:
In [6]: df.loc['student3'] = ['old','Tom']
...:
The first column get’s coerced to object
, since columns cannot have mixed dtypes:
In [7]: df.dtypes
Out[7]:
age object
name object
dtype: object
In this case, the underlying .values
will always return an array with the same underlying buffer, and changes to that array will be reflected in the data-frame:
In [11]: vals = df.values
In [12]: vals
Out[12]:
array([[21, 'Marry'],
[24, 'John'],
['old', 'Tom']], dtype=object)
In [13]: vals[0,0] = 'foo'
In [14]: vals
Out[14]:
array([['foo', 'Marry'],
[24, 'John'],
['old', 'Tom']], dtype=object)
In [15]: df
Out[15]:
age name
student1 foo Marry
student2 24 John
student3 old Tom
On the other hand, with mixed types like with your original data-frame:
In [26]: df = pd.DataFrame([{'name':'Marry', 'age':21},{'name':'John','age':24}]
...: ,index=['student1','student2'])
...:
In [27]: vals = df.values
In [28]: vals
Out[28]:
array([[21, 'Marry'],
[24, 'John']], dtype=object)
In [29]: vals[0,0] = 'foo'
In [30]: vals
Out[30]:
array([['foo', 'Marry'],
[24, 'John']], dtype=object)
In [31]: df
Out[31]:
age name
student1 21 Marry
student2 24 John
Note, however, that a view will only be returned if it is possible to be a view, i.e. if it is a proper slice, otherwise, a copy will be made regardless of the dtypes:
In [39]: df.loc['student3'] = ['old','Tom']
In [40]: df2
Out[40]:
name
student3 Tom
student2 John
In [41]: df2.loc[:] = 'foo'
In [42]: df2
Out[42]:
name
student3 foo
student2 foo
In [43]: df
Out[43]:
age name
student1 21 Marry
student2 24 John
student3 old Tom
I find the result is a little bit random. Sometimes it’s a copy sometimes it’s a view. For example:
df = pd.DataFrame([{'name':'Marry', 'age':21},{'name':'John','age':24}],index=['student1','student2'])
df
age name
student1 21 Marry
student2 24 John
Now, Let me try to modify it a little bit.
df2= df.loc['student1']
df2 [0] = 23
df
age name
student1 21 Marry
student2 24 John
As you can see, nothing changed. df2 is a copy. However, if I add another student into the dataframe…
df.loc['student3'] = ['old','Tom']
df
age name
student1 21 Marry
student2 24 John
student3 old Tom
Try to change the age again..
df3=df.loc['student1']
df3[0]=33
df
age name
student1 33 Marry
student2 24 John
student3 old Tom
Now df3 suddenly became a view. What is going on? I guess the value ‘old’ is the key?
You are starting with a DataFrame that has two columns with two different dtypes:
df.dtypes
Out:
age int64
name object
dtype: object
Since different dtypes are stored in different numpy arrays under the hood, you have two different blocks for them:
df.blocks
Out:
{'int64': age
student1 21
student2 24, 'object': name
student1 Marry
student2 John}
If you attempt to slice the first row of this DataFrame, it has to get one value from each different block which makes it necessary to create a copy.
df2.is_copy
Out[40]: <weakref at 0x7fc4487a9228; to 'DataFrame' at 0x7fc4488f9dd8>
In the second attempt, you are changing the dtypes. Since ‘old’ cannot be stored in an integer array, it casts the Series as an object Series.
df.loc['student3'] = ['old','Tom']
df.dtypes
Out:
age object
name object
dtype: object
Now all data for this DataFrame is stored in a single block (and in a single numpy array):
df.blocks
Out:
{'object': age name
student1 21 Marry
student2 24 John
student3 old Tom}
At this step, slicing the first row can be done on the numpy array without creating a copy, so it returns a view.
df3._is_view
Out: True
In general, you can get a view if the data-frame has a single dtype
, which is not the case with your original data-frame:
In [4]: df
Out[4]:
age name
student1 21 Marry
student2 24 John
In [5]: df.dtypes
Out[5]:
age int64
name object
dtype: object
However, when you do:
In [6]: df.loc['student3'] = ['old','Tom']
...:
The first column get’s coerced to object
, since columns cannot have mixed dtypes:
In [7]: df.dtypes
Out[7]:
age object
name object
dtype: object
In this case, the underlying .values
will always return an array with the same underlying buffer, and changes to that array will be reflected in the data-frame:
In [11]: vals = df.values
In [12]: vals
Out[12]:
array([[21, 'Marry'],
[24, 'John'],
['old', 'Tom']], dtype=object)
In [13]: vals[0,0] = 'foo'
In [14]: vals
Out[14]:
array([['foo', 'Marry'],
[24, 'John'],
['old', 'Tom']], dtype=object)
In [15]: df
Out[15]:
age name
student1 foo Marry
student2 24 John
student3 old Tom
On the other hand, with mixed types like with your original data-frame:
In [26]: df = pd.DataFrame([{'name':'Marry', 'age':21},{'name':'John','age':24}]
...: ,index=['student1','student2'])
...:
In [27]: vals = df.values
In [28]: vals
Out[28]:
array([[21, 'Marry'],
[24, 'John']], dtype=object)
In [29]: vals[0,0] = 'foo'
In [30]: vals
Out[30]:
array([['foo', 'Marry'],
[24, 'John']], dtype=object)
In [31]: df
Out[31]:
age name
student1 21 Marry
student2 24 John
Note, however, that a view will only be returned if it is possible to be a view, i.e. if it is a proper slice, otherwise, a copy will be made regardless of the dtypes:
In [39]: df.loc['student3'] = ['old','Tom']
In [40]: df2
Out[40]:
name
student3 Tom
student2 John
In [41]: df2.loc[:] = 'foo'
In [42]: df2
Out[42]:
name
student3 foo
student2 foo
In [43]: df
Out[43]:
age name
student1 21 Marry
student2 24 John
student3 old Tom