python multiprocessing dataframe rows
Question:
def main():
df_master = read_bb_csv(file)
p = Pool(2)
if len(df_master.index) >= 1:
for row in df_master.itertuples(index=True, name='Pandas'):
p.map((partial(check_option, arg1=row), df_master))
def check_option(row):
get_price(row)
I am using Pandas to read a CSV, to loop through rows and process the informaiton. Give the function get_price() need to make serveral http calls, I want to use multiprocess to process all rows at once (depends on the CPU cores) to speed up the function.
The issue I am having is that, I am new to multi process and don’t know how to use
p.map((check_option, arg1=row), df_master) to process all rows in the dataframe.
There is no need to return the row value back to the function. Just need to allow to rows to processes to handle.
Thank you for your help.
Answers:
You can use the following python3 version that I use everywhere and it works like a charm! There’s also a python3 package mpire
which I found really useful and usage is similar to that of python3’s multiprocessing package.
from multiprocessing import Pool
import pandas as pd
def get_price(idx, row):
# logic to fetch price
return idx, price
def main():
df = pd.read_csv("path to file")
NUM_OF_WORKERS = 2
with Pool(NUM_OF_WORKERS) as pool:
results = [pool.apply_async(get_price, [idx, row]) for idx, row in df.iterrows()]
for result in results:
idx, price = result.get()
df.loc[idx, 'Price'] = price
# do whatever you want to do with df, save it to same file.
if __name__ == "__main__":
# don't forget to call main func as module
# This is must in windows use multiple processes/threads. It's also a good practice, more info on this page https://docs.python.org/3/library/multiprocessing.html#multiprocessing-programming
main()
def main():
df_master = read_bb_csv(file)
p = Pool(2)
if len(df_master.index) >= 1:
for row in df_master.itertuples(index=True, name='Pandas'):
p.map((partial(check_option, arg1=row), df_master))
def check_option(row):
get_price(row)
I am using Pandas to read a CSV, to loop through rows and process the informaiton. Give the function get_price() need to make serveral http calls, I want to use multiprocess to process all rows at once (depends on the CPU cores) to speed up the function.
The issue I am having is that, I am new to multi process and don’t know how to use
p.map((check_option, arg1=row), df_master) to process all rows in the dataframe.
There is no need to return the row value back to the function. Just need to allow to rows to processes to handle.
Thank you for your help.
You can use the following python3 version that I use everywhere and it works like a charm! There’s also a python3 package mpire
which I found really useful and usage is similar to that of python3’s multiprocessing package.
from multiprocessing import Pool
import pandas as pd
def get_price(idx, row):
# logic to fetch price
return idx, price
def main():
df = pd.read_csv("path to file")
NUM_OF_WORKERS = 2
with Pool(NUM_OF_WORKERS) as pool:
results = [pool.apply_async(get_price, [idx, row]) for idx, row in df.iterrows()]
for result in results:
idx, price = result.get()
df.loc[idx, 'Price'] = price
# do whatever you want to do with df, save it to same file.
if __name__ == "__main__":
# don't forget to call main func as module
# This is must in windows use multiple processes/threads. It's also a good practice, more info on this page https://docs.python.org/3/library/multiprocessing.html#multiprocessing-programming
main()