What are the exact downsides of copy=False in DataFrame.merge()?

Question:

I am a bit confused about the argument copy in DataFrame.merge() after a co-worker asked me about that.

The docstring of DataFrame.merge() states:

copy : boolean, default True
    If False, do not copy data unnecessarily

The pandas documentation states:

copy: Always copy data (default True) from the passed DataFrame objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance / memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless.

The docstring kind of implies that copying the data is not necessary and might be skipped nearly always. The documention on the other hand says, that copying data can’t be avoided in many cases.

My questions are:

  • What are those cases?
  • What are the downsides?
Asked By: moritzbracht

||

Answers:

Disclaimer: I’m not very experienced with pandas and this is the first time I dug into its source, so I can’t guarantee that I’m not missing something in my below assessment.

The relevant bits of code have been recently refactored. I’ll discuss the subject in terms of the current stable version 0.20, but I don’t suspect functional changes compared to earlier versions.

The investigation starts with the source of merge in pandas/core/reshape/merge.py (formerly pandas/tools/merge.py). Ignoring some doc-aware decorators:

def merge(left, right, how='inner', on=None, left_on=None, right_on=None,
          left_index=False, right_index=False, sort=False,
          suffixes=('_x', '_y'), copy=True, indicator=False):
    op = _MergeOperation(left, right, how=how, on=on, left_on=left_on,
                         right_on=right_on, left_index=left_index,
                         right_index=right_index, sort=sort, suffixes=suffixes,
                         copy=copy, indicator=indicator)
    return op.get_result()

Calling merge will pass on the copy parameter to the constructor of class _MergeOperation, then calls its get_result() method. The first few lines with context:

# TODO: transformations??
# TODO: only copy DataFrames when modification necessary
class _MergeOperation(object):
    [...]

Now that second comment is highly suspicious. Moving on, the copy kwarg is bound to an eponymous instance attribute, which only seems to reappear once within the class:

result_data = concatenate_block_managers(
    [(ldata, lindexers), (rdata, rindexers)],
    axes=[llabels.append(rlabels), join_index],
    concat_axis=0, copy=self.copy)

We can then track down the concatenate_block_managers function in pandas/core/internals.py that just passes on the copy kwarg to concatenate_join_units.

We reached the final resting place of the original copy keyword argument in concatenate_join_units:

if len(to_concat) == 1:
    # Only one block, nothing to concatenate.
    concat_values = to_concat[0]
    if copy and concat_values.base is not None:
        concat_values = concat_values.copy()
else:
    concat_values = _concat._concat_compat(to_concat, axis=concat_axis)

As you can see, the only thing that copy does is rebind a copy of concat_values here to the same name in the special case of concatenation when there’s really nothing to concatenate.

Now, at this point my lack of pandas knowledge starts to show, because I’m not really sure what exactly is going on this deep inside the call stack. But the above hot-potato scheme with the copy keyword argument ending in that no-op-like branch of a concatenation function is perfectly consistent with the “TODO” comment above, the documentation quoted in the question:

copy: Always copy data (default True) from the passed DataFrame objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance / memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless.

(emphasis mine), and the related discussion on an old issue:

IIRC I think the copy parameter only matters here is its a trivial merge and you actually do want it copied (kind I like a reindex with the same index)

Based on these hints I suspect that in the very vast majority of real use cases copying is inevitable, and the copy keyword argument is never used. However, since for the small number of exceptions skipping a copy step might improve performance (without leading to any performance impact whatsoever for the majority of use cases in the mean time), the choice was implemented.

I suspect that the rationale is something like this: the upside of not doing a copy unless necessary (which is only possible in a very special few cases) is that the code avoids some memory allocations and copies in this case, but not returning a copy in a very special few cases might lead to unexpected surprises if one doesn’t expect that mutating the return value of merge could in any way affect the original dataframe. So the default value of the copy keyword argument is True, thus the user only doesn’t get a copy from merge if they explicitly volunteer for this (but even then they’ll still likely end up with a copy).

You will see the copy parameter as a convention throughout Pandas. It specifies whether or not to make a deep copy of the data in memory. If you don’t make a copy, you run the risk of mutating your data.

Answered By: Alex W
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.