python multithreading/ multiprocessing for a loop with 3+ arguments
Question:
Hello i have a csv with about 2,5k lines of outlook emails and passwords
The CSV looks like
header:
username, password
content:
[email protected],123password1
[email protected],123password2
[email protected],123password3
[email protected],123password4
[email protected],123password5
the code allows me to go into the accounts and delete every mail from them, but its taking too long for 2,5k accounts to pass the script so i wanted to make it faster with multithreading.
This is my code:
from csv import DictReader
import imap_tools
from datetime import datetime
def IMAPDumper(accountList, IMAP_SERVER, search_criteria, row):
accountcounter = 0
with open(accountList, 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
for row in csv_dict_reader:
# TIMESTAMP FOR FURTHER DEBUGGING TO CHECK IF THE SCRIPT IS STOPPING AT A POINT
TIMESTAMP = datetime.now().strftime("[%H:%M:%S]")
# adds a counter for the amount of accounts processed by the script
accountcounter = accountcounter + 1
print("_____________________________________________")
print(TIMESTAMP, "Account", accountcounter)
print("_____________________________________________")
# resetting emailcounter each time
emailcounter = 0
Answers:
This is not necessarily the best way to do it, but the shortest in writitng time. I don’t know if you are familiar with python generators, but we will have to use one. the generator will work as a work dispatcher.
def generator():
with open("t.csv", 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
for row in read_obj:
yield row
gen = generator()
Next, you will have your main function where you do your IMAP stuff
def main():
while True:
#The try prevent the thread from crashing when all the file will be processed
try:
#Returns next line of the csv
working_set = next(gen)
#do_some_stuff
# -
#do_other_stuff
except:
break
Then you just have to split the work in multiple thread!
#You can change the number of thread
number_of_threads = 5
thread_list = []
#Creates 5 thread object
for _ in range(number_of_threads):
thread_list.append(threading.Thread(target=main))
# Starts all thread object
for thread in thread_list:
thread.start()
I hope this helped you!
This is a job that is best accomplished using a thread pool whose optimum size will need to be experimented with. I have set the size below to 100, which may be overly ambitious (or not). You can try decreasing or increasing NUM_THREADS
to see what effect it has.
The important thing is to modify function IMAPDumper
so that it is passed a single row from the csv file that it is to be processed and that it therefore does not need to open and read the file itself.
There are various methods you can use with class ThreadPool
in module multiprocessing.pool
(this class is not well-documented; it is the multithreading analog of the multiprocessing pool class Pool
in module multiprocessing.pool
and has the same exact interface). The advantage of imap_unordered
is that (1) the passed iterable argument can be a generator that will not be converted to a list
, which will save memory and time if that list would be very large and (2) the ordering of the results (return values from the worker function, IMAPDumper
in this case) are arbitrary and therefore might run slightly faster than imap
or map
. Since your worker function does not explicitly return a value (defaults to None
), this should not matter.
from csv import DictReader
import imap_tools
from datetime import datetime
from multiprocessing.pool import ThreadPool
from functools import partial
def IMAPDumper(IMAP_SERVER, search_criteria, row):
""" process a single row """
# TIMESTAMP FOR FURTHER DEBUGGING TO CHECK IF THE SCRIPT IS STOPPING AT A POINT
TIMESTAMP = datetime.now().strftime("[%H:%M:%S]")
# adds a counter for the amount of accounts processed by the script
accountcounter = accountcounter + 1
print("_____________________________________________")
print(TIMESTAMP, "Account", accountcounter)
... # etc
def generate_rows():
""" generator function to yield rows """
with open('outlookAccounts.csv', newline='') as f:
dict_reader = DictReader(f)
for row in dict_reader:
yield row
NUM_THREADS = 100
worker = partial(IMAPDumper, "outlook.office365.com", "ALL")
pool = ThreadPool(NUM_THREADS)
for return_value in pool.imap_unordered(worker, generate_rows()):
# must iterate the iterator returned by imap_unordered to ensure all tasks are run and completed
pass # return values are None
Hello i have a csv with about 2,5k lines of outlook emails and passwords
The CSV looks like
header:
username, password
content:
[email protected],123password1
[email protected],123password2
[email protected],123password3
[email protected],123password4
[email protected],123password5
the code allows me to go into the accounts and delete every mail from them, but its taking too long for 2,5k accounts to pass the script so i wanted to make it faster with multithreading.
This is my code:
from csv import DictReader
import imap_tools
from datetime import datetime
def IMAPDumper(accountList, IMAP_SERVER, search_criteria, row):
accountcounter = 0
with open(accountList, 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
for row in csv_dict_reader:
# TIMESTAMP FOR FURTHER DEBUGGING TO CHECK IF THE SCRIPT IS STOPPING AT A POINT
TIMESTAMP = datetime.now().strftime("[%H:%M:%S]")
# adds a counter for the amount of accounts processed by the script
accountcounter = accountcounter + 1
print("_____________________________________________")
print(TIMESTAMP, "Account", accountcounter)
print("_____________________________________________")
# resetting emailcounter each time
emailcounter = 0
This is not necessarily the best way to do it, but the shortest in writitng time. I don’t know if you are familiar with python generators, but we will have to use one. the generator will work as a work dispatcher.
def generator():
with open("t.csv", 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
for row in read_obj:
yield row
gen = generator()
Next, you will have your main function where you do your IMAP stuff
def main():
while True:
#The try prevent the thread from crashing when all the file will be processed
try:
#Returns next line of the csv
working_set = next(gen)
#do_some_stuff
# -
#do_other_stuff
except:
break
Then you just have to split the work in multiple thread!
#You can change the number of thread
number_of_threads = 5
thread_list = []
#Creates 5 thread object
for _ in range(number_of_threads):
thread_list.append(threading.Thread(target=main))
# Starts all thread object
for thread in thread_list:
thread.start()
I hope this helped you!
This is a job that is best accomplished using a thread pool whose optimum size will need to be experimented with. I have set the size below to 100, which may be overly ambitious (or not). You can try decreasing or increasing NUM_THREADS
to see what effect it has.
The important thing is to modify function IMAPDumper
so that it is passed a single row from the csv file that it is to be processed and that it therefore does not need to open and read the file itself.
There are various methods you can use with class ThreadPool
in module multiprocessing.pool
(this class is not well-documented; it is the multithreading analog of the multiprocessing pool class Pool
in module multiprocessing.pool
and has the same exact interface). The advantage of imap_unordered
is that (1) the passed iterable argument can be a generator that will not be converted to a list
, which will save memory and time if that list would be very large and (2) the ordering of the results (return values from the worker function, IMAPDumper
in this case) are arbitrary and therefore might run slightly faster than imap
or map
. Since your worker function does not explicitly return a value (defaults to None
), this should not matter.
from csv import DictReader
import imap_tools
from datetime import datetime
from multiprocessing.pool import ThreadPool
from functools import partial
def IMAPDumper(IMAP_SERVER, search_criteria, row):
""" process a single row """
# TIMESTAMP FOR FURTHER DEBUGGING TO CHECK IF THE SCRIPT IS STOPPING AT A POINT
TIMESTAMP = datetime.now().strftime("[%H:%M:%S]")
# adds a counter for the amount of accounts processed by the script
accountcounter = accountcounter + 1
print("_____________________________________________")
print(TIMESTAMP, "Account", accountcounter)
... # etc
def generate_rows():
""" generator function to yield rows """
with open('outlookAccounts.csv', newline='') as f:
dict_reader = DictReader(f)
for row in dict_reader:
yield row
NUM_THREADS = 100
worker = partial(IMAPDumper, "outlook.office365.com", "ALL")
pool = ThreadPool(NUM_THREADS)
for return_value in pool.imap_unordered(worker, generate_rows()):
# must iterate the iterator returned by imap_unordered to ensure all tasks are run and completed
pass # return values are None