Apply function to dataframe row use result for next row input
Question:
I am trying to create a rudimentary scheduling system. Here is what I have so far:
I have a pandas dataframe job_data that looks like this:
wc
job
start
duration
1
J1
2022-08-16 07:30:00
17
1
J2
2022-08-16 07:30:00
5
2
J3
2022-08-16 07:30:00
21
2
J4
2022-08-16 07:30:00
12
It contains a wc (work center), job, a start date and duration for the job in hours.
I have created a function add_hours that takes the following arguments: start (datetime), hours (int).
It calculates the when the job will be complete based on the start time and duration.
The code for add_hours is:
def is_in_open_hours(dt):
return (
dt.weekday() in business_hours["weekdays"]
and dt.date() not in holidays
and business_hours["from"].hour <= dt.time().hour < business_hours["to"].hour
)
def get_next_open_datetime(dt):
while True:
dt = dt + timedelta(days=1)
if dt.weekday() in business_hours["weekdays"] and dt.date() not in holidays:
dt = datetime.combine(dt.date(), business_hours["from"])
return dt
def add_hours(dt, hours):
while hours != 0:
if is_in_open_hours(dt):
dt = dt + timedelta(hours=1)
hours = hours - 1
else:
dt = get_next_open_datetime(dt)
return dt
The code to calculate the end column is:
df["end"] = df.apply(lambda x: add_hours(x.start, x.duration), axis=1)
The result of function is the end column:
wc
job
start
duration
end
1
J1
2022-08-16 07:30:00
17
2022-08-17 14:00:00
1
J2
2022-08-16 07:30:00
5
2022-08-17 10:00:00
2
J3
2022-08-16 07:30:00
21
2022-08-18 08:00:00
2
J4
2022-08-16 07:30:00
12
2022-08-18 08:00:00
Problem is, I need the start datetime in the second row to be the end datetime from the previous row instead of them all using the same start date. I also need to start this process over for each wc.
So the desired output would be:
wc
job
start
duration
end
1
J1
2022-08-16 07:30:00
17
2022-08-17 14:00:00
1
J2
2022-08-17 14:00:00
5
2022-08-17 19:00:00
2
J3
2022-08-16 07:30:00
21
2022-08-18 08:00:00
2
J4
2022-08-18 08:00:00
10
2022-08-18 18:00:00
Answers:
You can use Timedelta
and groupby
operations.
As you did not provide your custom function, I’ll apply here a simple addition of the duration:
df['start'] = pd.to_datetime(df['start'])
t = pd.to_timedelta(df['duration'], unit='h')
g = t.groupby(df['wc'])
df['start'] = df['start'].add(g.apply(lambda x: x.cumsum().shift(fill_value=pd.Timedelta('0'))))
df['end'] = df['start'].add(t)
Output:
wc job start duration end
0 1 J1 2022-08-16 07:30:00 17 2022-08-17 00:30:00
1 1 J2 2022-08-17 00:30:00 5 2022-08-17 05:30:00
2 2 J3 2022-08-16 07:30:00 21 2022-08-17 04:30:00
3 2 J4 2022-08-17 04:30:00 12 2022-08-17 16:30:00
I show an alternative method where you only need the first start date
and then bootstrap the lists according to the job durations.
# import required modules
import io
import pandas as pd
from datetime import datetime
from datetime import timedelta
# make a dataframe
# note: only the first start date is required
x = '''
wc job start duration end
1 J1 2022-08-16 07:30:00 17 2022-08-17 14:00:00
1 J2 2022-08-16 07:30:00 5 2022-08-17 10:00:00
2 J3 2022-08-16 07:30:00 21 2022-08-18 08:00:00
2 J4 2022-08-16 07:30:00 12 2022-08-18 08:00:00
'''
data = io.StringIO(x)
df = pd.read_csv(data, sep='t')
# construct start and end lists
start = datetime.strptime(df['start'][0], '%Y-%m-%d %H:%M:%S')
start_list = [start]
end_list = []
for x in df['duration']:
time_change = timedelta(hours=float(x))
new_time = start_list[-1] + time_change
start_list.append(new_time)
end_list.append(new_time)
start_list.pop(-1)
# add to dataframe
df['start'] = start_list
df['end'] = end_list
# finished
df
The result is this:
I’m not sure what’s the size of your dataset, but if it’s not too big you could use the following elegant solution (which would take quite a while to run because your’e replicating calculations)
df['cum_duration'] = df.groupby('wc').duration.transform(sum)
df['end'] = df.apply(lambda x: add_hours(x.start, x.cum_duration), axis=1)
If the OP provides the business_hours
df I could try to validate this solution
I am trying to create a rudimentary scheduling system. Here is what I have so far:
I have a pandas dataframe job_data that looks like this:
wc | job | start | duration |
---|---|---|---|
1 | J1 | 2022-08-16 07:30:00 | 17 |
1 | J2 | 2022-08-16 07:30:00 | 5 |
2 | J3 | 2022-08-16 07:30:00 | 21 |
2 | J4 | 2022-08-16 07:30:00 | 12 |
It contains a wc (work center), job, a start date and duration for the job in hours.
I have created a function add_hours that takes the following arguments: start (datetime), hours (int).
It calculates the when the job will be complete based on the start time and duration.
The code for add_hours is:
def is_in_open_hours(dt):
return (
dt.weekday() in business_hours["weekdays"]
and dt.date() not in holidays
and business_hours["from"].hour <= dt.time().hour < business_hours["to"].hour
)
def get_next_open_datetime(dt):
while True:
dt = dt + timedelta(days=1)
if dt.weekday() in business_hours["weekdays"] and dt.date() not in holidays:
dt = datetime.combine(dt.date(), business_hours["from"])
return dt
def add_hours(dt, hours):
while hours != 0:
if is_in_open_hours(dt):
dt = dt + timedelta(hours=1)
hours = hours - 1
else:
dt = get_next_open_datetime(dt)
return dt
The code to calculate the end column is:
df["end"] = df.apply(lambda x: add_hours(x.start, x.duration), axis=1)
The result of function is the end column:
wc | job | start | duration | end |
---|---|---|---|---|
1 | J1 | 2022-08-16 07:30:00 | 17 | 2022-08-17 14:00:00 |
1 | J2 | 2022-08-16 07:30:00 | 5 | 2022-08-17 10:00:00 |
2 | J3 | 2022-08-16 07:30:00 | 21 | 2022-08-18 08:00:00 |
2 | J4 | 2022-08-16 07:30:00 | 12 | 2022-08-18 08:00:00 |
Problem is, I need the start datetime in the second row to be the end datetime from the previous row instead of them all using the same start date. I also need to start this process over for each wc.
So the desired output would be:
wc | job | start | duration | end |
---|---|---|---|---|
1 | J1 | 2022-08-16 07:30:00 | 17 | 2022-08-17 14:00:00 |
1 | J2 | 2022-08-17 14:00:00 | 5 | 2022-08-17 19:00:00 |
2 | J3 | 2022-08-16 07:30:00 | 21 | 2022-08-18 08:00:00 |
2 | J4 | 2022-08-18 08:00:00 | 10 | 2022-08-18 18:00:00 |
You can use Timedelta
and groupby
operations.
As you did not provide your custom function, I’ll apply here a simple addition of the duration:
df['start'] = pd.to_datetime(df['start'])
t = pd.to_timedelta(df['duration'], unit='h')
g = t.groupby(df['wc'])
df['start'] = df['start'].add(g.apply(lambda x: x.cumsum().shift(fill_value=pd.Timedelta('0'))))
df['end'] = df['start'].add(t)
Output:
wc job start duration end
0 1 J1 2022-08-16 07:30:00 17 2022-08-17 00:30:00
1 1 J2 2022-08-17 00:30:00 5 2022-08-17 05:30:00
2 2 J3 2022-08-16 07:30:00 21 2022-08-17 04:30:00
3 2 J4 2022-08-17 04:30:00 12 2022-08-17 16:30:00
I show an alternative method where you only need the first start date
and then bootstrap the lists according to the job durations.
# import required modules
import io
import pandas as pd
from datetime import datetime
from datetime import timedelta
# make a dataframe
# note: only the first start date is required
x = '''
wc job start duration end
1 J1 2022-08-16 07:30:00 17 2022-08-17 14:00:00
1 J2 2022-08-16 07:30:00 5 2022-08-17 10:00:00
2 J3 2022-08-16 07:30:00 21 2022-08-18 08:00:00
2 J4 2022-08-16 07:30:00 12 2022-08-18 08:00:00
'''
data = io.StringIO(x)
df = pd.read_csv(data, sep='t')
# construct start and end lists
start = datetime.strptime(df['start'][0], '%Y-%m-%d %H:%M:%S')
start_list = [start]
end_list = []
for x in df['duration']:
time_change = timedelta(hours=float(x))
new_time = start_list[-1] + time_change
start_list.append(new_time)
end_list.append(new_time)
start_list.pop(-1)
# add to dataframe
df['start'] = start_list
df['end'] = end_list
# finished
df
The result is this:
I’m not sure what’s the size of your dataset, but if it’s not too big you could use the following elegant solution (which would take quite a while to run because your’e replicating calculations)
df['cum_duration'] = df.groupby('wc').duration.transform(sum)
df['end'] = df.apply(lambda x: add_hours(x.start, x.cum_duration), axis=1)
If the OP provides the business_hours
df I could try to validate this solution