Why use pandas.assign rather than simply initialize new column?

Question:

I just discovered the assign method for pandas dataframes, and it looks nice and very similar to dplyr’s mutate in R. However, I’ve always gotten by by just initializing a new column ‘on the fly’. Is there a reason why assign is better?

For instance (based on the example in the pandas documentation), to create a new column in a dataframe, I could just do this:

df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
df['ln_A'] = np.log(df['A'])

but the pandas.DataFrame.assign documentation recommends doing this:

df.assign(ln_A = lambda x: np.log(x.A))
# or 
newcol = np.log(df['A'])
df.assign(ln_A=newcol)

Both methods return the same dataframe. In fact, the first method (my ‘on the fly’ assignment) is significantly faster (0.202 seconds for 1000 iterations) than the .assign method (0.353 seconds for 1000 iterations).

So is there a reason I should stop using my old method in favour of df.assign?

Asked By: sacuL

||

Answers:

The difference concerns whether you wish to modify an existing frame, or create a new frame while maintaining the original frame as it was.

In particular, DataFrame.assign returns you a new object that has a copy of the original data with the requested changes … the original frame remains unchanged.

In your particular case:

>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})

Now suppose you wish to create a new frame in which A is everywhere 1 without destroying df. Then you could use .assign

>>> new_df = df.assign(A=1)

If you do not wish to maintain the original values, then clearly df["A"] = 1 will be more appropriate. This also explains the speed difference, by necessity .assign must copy the data while [...] does not.

Answered By: donkopotamus

The premise on assign is that it returns:

A new DataFrame with the new columns in addition to all the existing columns.

And also you cannot do anything in-place to change the original dataframe.

The callable must not change input DataFrame (though pandas doesn’t check it).

On the other hand df['ln_A'] = np.log(df['A']) will do things inplace.


So is there a reason I should stop using my old method in favour of df.assign?

I think you can try df.assign but if you do memory intensive stuff, better to work what you did before or operations with inplace=True.

Answered By: prosti

How about df.apply()? Is that imposes changes to the original df?

Answered By: Melika
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.