I’m confused about the rules Pandas uses when deciding that a selection from a dataframe is a copy of the original dataframe, or a view on the original.
If I have, for example,
df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))
I understand that a
query returns a copy so that something like
foo = df.query('2 < index <= 5') foo.loc[:,'E'] = 40
will have no effect on the original dataframe,
df. I also understand that scalar or named slices return a view, so that assignments to these, such as
df.iloc = 70
df.ix[1,'B':'E'] = 222
df. But I’m lost when it comes to more complicated cases. For example,
df[df.C <= df.B] = 7654321
df[df.C <= df.B].ix[:,'B':'E']
Is there a simple rule that Pandas is using that I’m just missing? What’s going on in these specific cases; and in particular, how do I change all values (or a subset of values) in a dataframe that satisfy a particular query (as I’m attempting to do in the last example above)?
Note: This is not the same as this question; and I have read the documentation, but am not enlightened by it. I’ve also read through the “Related” questions on this topic, but I’m still missing the simple rule Pandas is using, and how I’d apply it to — for example — modify the values (or a subset of values) in a dataframe that satisfy a particular query.
Here’s the rules, subsequent override:
All operations generate a copy
inplace=True is provided, it will modify in-place; only some operations support this
An indexer that sets, e.g.
.loc/.iloc/.iat/.at will set inplace.
An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that’s why this is not reliable). This is mainly for efficiency. (the example from above is for
.query; this will always return a copy as its evaluated by
An indexer that gets on a multiple-dtyped object is always a copy.
Your example of
df[df.C <= df.B].loc[:,'B':'E']
is not guaranteed to work (and thus you should never do this).
df.loc[df.C <= df.B, 'B':'E']
as this is faster and will always work
The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a
SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.
Here is something funny:
u = df v = df.loc[:, :] w = df.iloc[:,:] z = df.iloc[0:, ]
The first three seem to be all references of df, but the last one is not!