Trouble passing in lambda to apply for pandas DataFrame

Question:

I’m trying to apply a function to all rows of a pandas DataFrame (actually just one column in that DataFrame)

I’m sure this is a syntax error but I’m know sure what I’m doing wrong

df['col'].apply(lambda x, y:(x - y).total_seconds(), args=[d1], axis=1)

The col column contains a bunch a datetime.datetime objects and d1 is the earliest of them. I’m trying to get a column of the total number of seconds for each of the rows.

I keep getting the following error

TypeError: <lambda>() got an unexpected keyword argument 'axis'

I don’t understand why axis is getting passed to my lambda function

I’ve also tried doing

def diff_dates(d1, d2):
    return (d1-d2).total_seconds()

df['col'].apply(diff_dates, args=[d1], axis=1)

And I get the same error.

Asked By: sedavidw

||

Answers:

Note there is no axis param for a Series.apply call, as distinct to a DataFrame.apply call.

Series.apply(func, convert_dtype=True, args=(), **kwds)

func : function
convert_dtype : boolean, default True
Try to find better dtype for elementwise function results. If False, leave as dtype=object
args : tuple
Positional arguments to pass to function in addition to the value

There is one for a df but it’s unclear how you’re expecting this to work when you’re calling it on a series but you’re expecting it to work on a row?

Answered By: EdChum

A single column is (usually) a pandas Series, and as EdChum mentioned, DataFrame.apply has axis argument but Series.apply hasn’t, so apply on axis=1 wouldn’t work on columns.

The following works:

df['col'].apply(lambda x, y: (x - y).total_seconds(), args=(d1,))

For applying a function for each element in a row, map can also be used:

df['col'].map(lambda x: (x - d1).total_seconds())

As apply is just a syntactic sugar for a Python loop, a list comprehension may be more efficient than both of them because it doesn’t have the pandas overhead:

[(x - d1).total_seconds() for x in df['col'].tolist()]

For a single column DataFrame, axis=1 may be passed:

df[['col']].apply(lambda x, y: (x - y).dt.total_seconds(), args=[d1], axis=1)

PSA: Avoid apply if you can

apply is not even needed most of the time. For the case in the OP (and most other cases), a vectorized operation exists (just subtract d1 from the column – the value is broadcast to match the column) and is much faster than apply anyway:

(df['col'] - d1).dt.total_seconds()

Timings

The vectorized subtraction is about 150 times faster than apply on a column and over 7000 times faster than apply on a single column DataFrame for a frame with 10k rows. As apply is a loop, this gap gets bigger as the number of rows increase.

df = pd.DataFrame({'col': pd.date_range('2000', '2023', 10_000)})
d1 = df['col'].min()

%timeit df['col'].apply(lambda x, y: (x - y).total_seconds(), args=[d1])
# 124 ms ± 7.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['col'].map(lambda x: (x - d1).total_seconds())
# 127 ms ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit [(x - d1).total_seconds() for x in df['col'].tolist()]
# 107 ms ± 4.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit (df['col'] - d1).dt.total_seconds()
# 851 µs ± 189 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit df[['col']].apply(lambda x, y: (x - y).dt.total_seconds(), args=[d1], axis=1)
# 6.07 s ± 419 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Answered By: cottontail