summing a time-series column upon it self with a condition
Question:
I’m currently working with some data that I receive from an engineering plant, the data comes out(roughly) as the following :
df = pd.DataFrame({'ID' : np.random.randint(1,25,size=5),
'on/off' : np.random.randint(0,2,size=5),
'Time' : pd.date_range(start='01/01/2019',periods=5,freq='5s')})
print(df)
ID on/off Time
0 17 0 2019-01-01 00:00:00
1 21 0 2019-01-01 00:00:05
2 12 1 2019-01-01 00:00:10
3 12 1 2019-01-01 00:00:15
4 12 0 2019-01-01 00:00:20
the 0 and 1 in the on/off column correspond to when a machine is on or off (0 = on 1 = off)
currently, I use the following line of beautiful code to get the difference between my column as the data is rolling
df['Time Difference'] = (df.time - df.time.shift())
print(df)
ID on/off Time Time Difference
0 17 0 2019-01-01 00:00:00 NaT
1 21 0 2019-01-01 00:00:05 00:00:05
2 12 1 2019-01-01 00:00:10 00:00:05
3 12 1 2019-01-01 00:00:15 00:00:05
4 12 0 2019-01-01 00:00:20 00:00:05
now as this dataframe is quite verbose (each week I’ll receive about 150k rows)
what would be the best way to sum amount time a machine is off (where df[‘on/off] == 1) until the next 0 comes along? so in the above example for the 1st of January 2019 the machine of ID 12 didn’t run for 15 seconds until it resumed at 00:00:20
Answers:
Here’s an approach that works for a simple example of one machine that varies between on and off during the course of one day. It works regardless of whether the machine is in on
or off
state in the first row.
df = pd.DataFrame({'ID': [12, 12, 12, 12, 12],
'on/off': [0,0,1,0,1],
'Time': ['2019-01-01 00:00:00', '2019-01-01 00:00:05', '2019-01-01 00:00:10','2019-01-01 00:00:15','2019-01-01 00:00:20']
})
ID on/off Time
0 12 0 2019-01-01 00:00:00
1 12 0 2019-01-01 00:00:05
2 12 1 2019-01-01 00:00:10
3 12 0 2019-01-01 00:00:15
4 12 1 2019-01-01 00:00:20
- First I made sure the
Time
column dtype is datetime64:
df['Time'] = pd.to_datetime(df['Time'])
- Then I get the indices of all rows where the state changed (either from
off
to on
, or from on
to off
:
s = df[df['on/off'].shift(1) != df['on/off']].index
df = df.loc[s]
- Then I create a column called
time shift
, which shows the timestamp of the most recent row where power state changed:
df['time shift'] = df['Time'].shift(1)
At this point the dataframe looks like this:
ID on/off Time time shift
0 12 0 2019-01-01 00:00:00 NaT
2 12 1 2019-01-01 00:00:10 2019-01-01 00:00:00
3 12 0 2019-01-01 00:00:15 2019-01-01 00:00:10
4 12 1 2019-01-01 00:00:20 2019-01-01 00:00:15
- Now, since we want to count the duration that the machine was off, I look at only the row indices where the state became
on
:
r = df[df['on/off'] == 1].index
df = df.loc[r]
At this point, the dataframe looks as it does below. Notice that the time shift
column is displaying the point at which the machine most recently turned off, prior to the time being displayed in Time
column, which is the timestamp when the machine turned back on. Finding the difference between these two columns will give us the length of each duration that the machine was off during the day:
ID on/off Time time shift
2 12 1 2019-01-01 00:00:10 2019-01-01 00:00:00
4 12 1 2019-01-01 00:00:20 2019-01-01 00:00:15
- The following line calculates total off-time, by summing the durations of each period that the machine was in its
off
state:
(df['Time'] - df['time shift']).sum()
Which outputs:
Timedelta('0 days 00:00:15')
Some additional context on how the Pandas .shift()
method works:
Shift takes all the rows in a column, and moves them either forward or back by a certain amount. .shift(1)
tells pandas to move the index of each row forward, or up, by 1. .shift(-1)
tells pandas to move the index of each row back, or down, by 1. Alternately put, .shift(1)
lets you look at the value of a column at the previous row index, and .shift(-1)
lets you look at the value of a column at the next row index, relative a given row in a column. It’s a handy way to compare a column’s values across different rows, without resorting to for-loops.
I’m currently working with some data that I receive from an engineering plant, the data comes out(roughly) as the following :
df = pd.DataFrame({'ID' : np.random.randint(1,25,size=5),
'on/off' : np.random.randint(0,2,size=5),
'Time' : pd.date_range(start='01/01/2019',periods=5,freq='5s')})
print(df)
ID on/off Time
0 17 0 2019-01-01 00:00:00
1 21 0 2019-01-01 00:00:05
2 12 1 2019-01-01 00:00:10
3 12 1 2019-01-01 00:00:15
4 12 0 2019-01-01 00:00:20
the 0 and 1 in the on/off column correspond to when a machine is on or off (0 = on 1 = off)
currently, I use the following line of beautiful code to get the difference between my column as the data is rolling
df['Time Difference'] = (df.time - df.time.shift())
print(df)
ID on/off Time Time Difference
0 17 0 2019-01-01 00:00:00 NaT
1 21 0 2019-01-01 00:00:05 00:00:05
2 12 1 2019-01-01 00:00:10 00:00:05
3 12 1 2019-01-01 00:00:15 00:00:05
4 12 0 2019-01-01 00:00:20 00:00:05
now as this dataframe is quite verbose (each week I’ll receive about 150k rows)
what would be the best way to sum amount time a machine is off (where df[‘on/off] == 1) until the next 0 comes along? so in the above example for the 1st of January 2019 the machine of ID 12 didn’t run for 15 seconds until it resumed at 00:00:20
Here’s an approach that works for a simple example of one machine that varies between on and off during the course of one day. It works regardless of whether the machine is in on
or off
state in the first row.
df = pd.DataFrame({'ID': [12, 12, 12, 12, 12],
'on/off': [0,0,1,0,1],
'Time': ['2019-01-01 00:00:00', '2019-01-01 00:00:05', '2019-01-01 00:00:10','2019-01-01 00:00:15','2019-01-01 00:00:20']
})
ID on/off Time
0 12 0 2019-01-01 00:00:00
1 12 0 2019-01-01 00:00:05
2 12 1 2019-01-01 00:00:10
3 12 0 2019-01-01 00:00:15
4 12 1 2019-01-01 00:00:20
- First I made sure the
Time
column dtype is datetime64:
df['Time'] = pd.to_datetime(df['Time'])
- Then I get the indices of all rows where the state changed (either from
off
toon
, or fromon
tooff
:
s = df[df['on/off'].shift(1) != df['on/off']].index
df = df.loc[s]
- Then I create a column called
time shift
, which shows the timestamp of the most recent row where power state changed:
df['time shift'] = df['Time'].shift(1)
At this point the dataframe looks like this:
ID on/off Time time shift
0 12 0 2019-01-01 00:00:00 NaT
2 12 1 2019-01-01 00:00:10 2019-01-01 00:00:00
3 12 0 2019-01-01 00:00:15 2019-01-01 00:00:10
4 12 1 2019-01-01 00:00:20 2019-01-01 00:00:15
- Now, since we want to count the duration that the machine was off, I look at only the row indices where the state became
on
:
r = df[df['on/off'] == 1].index
df = df.loc[r]
At this point, the dataframe looks as it does below. Notice that the time shift
column is displaying the point at which the machine most recently turned off, prior to the time being displayed in Time
column, which is the timestamp when the machine turned back on. Finding the difference between these two columns will give us the length of each duration that the machine was off during the day:
ID on/off Time time shift
2 12 1 2019-01-01 00:00:10 2019-01-01 00:00:00
4 12 1 2019-01-01 00:00:20 2019-01-01 00:00:15
- The following line calculates total off-time, by summing the durations of each period that the machine was in its
off
state:
(df['Time'] - df['time shift']).sum()
Which outputs:
Timedelta('0 days 00:00:15')
Some additional context on how the Pandas .shift()
method works:
Shift takes all the rows in a column, and moves them either forward or back by a certain amount. .shift(1)
tells pandas to move the index of each row forward, or up, by 1. .shift(-1)
tells pandas to move the index of each row back, or down, by 1. Alternately put, .shift(1)
lets you look at the value of a column at the previous row index, and .shift(-1)
lets you look at the value of a column at the next row index, relative a given row in a column. It’s a handy way to compare a column’s values across different rows, without resorting to for-loops.