fastest way to apply an async function to pandas dataframe

Question:

There is an apply method in pandas dataframe that allows to apply some sync functions like:

import numpy as np
import pandas as pd

def fun(x):
    return x * 2

df = pd.DataFrame(np.arange(10), columns=['old'])

df['new'] = df['old'].apply(fun)

What is the fastest way to do similar thing if there is an async function fun2 that has to be applied:

import asyncio
import numpy as np
import pandas as pd

async def fun2(x):
    return x * 2

async def main():
    df = pd.DataFrame(np.arange(10), columns=['old'])
    df['new'] = 0    
    for i in range(len(df)):
        df['new'].iloc[i] = await fun2(df['old'].iloc[i])
    print(df)

asyncio.run(main())

Answers:

Use asyncio.gather and overwrite the whole column when complete.

import asyncio

import numpy as np
import pandas as pd


async def fun2(x):
    return x * 2


async def main():
    df = pd.DataFrame(np.arange(10), columns=['old'])
    df['new'] = await asyncio.gather(*(fun2(v) for v in df['old']))
    print(df)


asyncio.run(main())

Doing it this way will pass each value in the column to the async function, meaning that all column values will be being run concurrently (which will be much faster than awaiting each function result sequentially in a loop).

Note: Column order is guaranteed to be preserved by asyncio.gather and the column will not be resolved until all awaitables have successfully completed.

Resulting output DataFrame:

   old  new
0    0    0
1    1    2
2    2    4
3    3    6
4    4    8
5    5   10
6    6   12
7    7   14
8    8   16
9    9   18
Answered By: Henry Ecker