How can I iterate through list data faster using multiprocessing?

Question:

I’m trying to determine the amount of time worked by a list of employees during their shift – this data is given to me in the form of a CSV file.

I populate a matrix with this data and iterate through it using a while loop applying the necessary conditionals (for example, deducting 30 minute for lunch). This is then put into a new list, which is used to make an Excel worksheet.

My script does what it is meant to do, but takes a very long time when having to loop through a lot of data (it needs to loop through approximately 26 000 rows).
My idea is to use multiprocessing to do the following three loops in parallel:

  1. Convert the time from hh:mm:ss to minutes.
  2. Loop through and apply conditionals.
  3. Round values and convert back to hours, so that this is not done within the big while loop.

Is this a good idea?
If so, how would I have the loops run in parallel when I need data from one loop to be used in the next? My first thought is to use the time function to give a delay, but then I’m concerned that my loops may "catch up" with one another and spit out that the list index being called does not exist.

Any more experienced opinions would be amazing, thanks!

My script:

import pandas as pd
# Function: To round down the time to the next lowest ten minutes --> 77 = 70 ; 32 = 30:

def floor_time(n, decimals=0):

    multiplier = 10 ** decimals
    return int(n * multiplier) / multiplier
# Function: Get data from excel spreadsheet:

def get_data():

    df = pd.read_csv('/Users/Chadd/Desktop/dd.csv', sep = ',')
    list_of_rows = [list(row) for row in df.values]
    data = []
    i = 0
    while i < len(list_of_rows):
        data.append(list_of_rows[i][0].split(';'))
        data[i].pop()
        i += 1
    return data
# Function: Convert time index in data to 24 hour scale:

def get_time(time_data):

    return int(time_data.split(':')[0])*60 + int(time_data.split(':')[1])
# Function: Loop through data in CSV applying conditionals:

def get_time_worked():

    i = 0 # Looping through entry data
    j = 1 # Looping through departure data
    list_of_times = []

    while j < len(get_data()):

        start_time = get_time(get_data()[i][3])
        end_time = get_time(get_data()[j][3])

         # Morning shift - start time < end time
        if start_time < end_time:
            time_worked = end_time - start_time # end time - start time (minutes)
            # Need to deduct 15 minutes if late:
            if start_time > 6*60: # Late
                time_worked = time_worked - 15
            # Need to set the start time to 06:00:00:
            if start_time < 6*60: # Early
                time_worked = end_time - 6*60

        # Afternoon shift - start time > end time
        elif start_time > end_time:
            time_worked = 24*60 - start_time + end_time # 24*60 - start time + end time (minutes)
            # Need to deduct 15 minutes if late:
            if start_time > 18*60: # Late
                time_worked = time_worked - 15
            # Need to set the start time to 18:00:00:
            if start_time > 18*60: # Early
                time_worked = 24*60 - 18*60 + end_time

        # If time worked exceeds 5 hours, deduct 30 minutes for lunch:
        if time_worked >= 5*60:
            time_worked = time_worked - 30

        # Set max time worked to 11.5 hours:
        if time_worked > 11.5*60:
            time_worked = 11.5*60

        list_of_times.append([get_data()[i][1], get_data()[i][2], round(floor_time(time_worked, decimals = -1)/60, 2)])

        i += 2
        j += 2

    return list_of_times
# Save the data into Excel worksheet:

def save_data():

    file_heading = '{} to {}'.format(get_data()[0][2], get_data()[len(get_data())-1][2])
    file_heading_2 = file_heading.replace('/', '_')

    df = pd.DataFrame(get_time_worked())
    writer = pd.ExcelWriter('/Users/Chadd/Desktop/{}.xlsx'.format(file_heading_2), engine='xlsxwriter')
    df.to_excel(writer, sheet_name='Hours Worked', index=False)
    writer.save()

save_data()
Asked By: ChaddRobertson

||

Answers:

You can look at multiprocessing.Pool which allows executing a function multiple times with different input variables. From the docs

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    with Pool(5) as p:
        print(p.map(f, [1, 2, 3]))

Then, it’s a matter of splitting up your data into chunks (instead of the [1, 2, 3] in the example).
But, my personal preference, is to take the time and learn something that is distributed by default. Such as Spark and pyspark. It’ll help you in the long run immensely.

Answered By: edd