Problem solving without for loop,but with vectorization (pandas dataframe) considering multi condition

Question:

I have a dataframe like below :

id = ['A','A','A','A','A','B','B','B']
workcycle = [0,4,5,100,140,0,5,20]
date = ['2022-01-01','2022-01-01','2022-01-02','2022-02-04','2022-03-10','2022-01-01','2022-01-02','2022-02-04']
failure_type=['A','B',None,'B',None,'A',None,None]
repair_type=[None,None,'Repair_Type_1',None,'Repair_Type_2',None,'Repair_Type_1','Repair_Type_2']
event=['failure','failure','Repair','failure','Repair','failure','Repair','Repair']

[![enter image description here](https://i.stack.imgur.com/4oDGC.png)](https://i.stack.imgur.com/4oDGC.png)

This is dataframe about failure and repair of some machine. And there are multi failure type.

There are two criteria (workcycle and date) in parallel.

Let’s think about just "workcycle"

I want to know about what is the main contributor of repair activity against failure.

My hypothesis :

Among failures which happened less than 30 workcycle before repair, if there are no same type of failure in 30 next workcycle after repair activity => wa can say that this repair activity is the main contributor of this failure.

For example, we can say that repair activity of workcycle 5 for machine id = A is the main contributor of failure solution of workcycle 0 for id =A.

However, I have one conditon :

If there are more than two repair activity after failure and difference of workcycle betwwen two repair is less than 30 workcycle, we will not treat it.

I want to have a result like below:

[![enter image description here](https://i.stack.imgur.com/WevxC.png)](https://i.stack.imgur.com/WevxC.png)

I know how to do it using for loop, but I used too many for loop.

How can I use vectorlization using just np.where,loc, or others ?

Asked By: stat_man

||

Answers:

import numpy as np
import pandas as pd

id_ = ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']
workcycle = [0, 5, 100, 140, 0, 5, 20, 10]
date = ['2022-01-01', '2022-01-02', '2022-02-04', '2022-03-10', '2022-01-01', '2022-01-02', '2022-02-04', '2022-02-04']
failure_type = ['A', None, 'B', None, 'A', None, None, None]
event = ['failure', 'Repair', 'failure', 'Repair', 'failure', 'Repair', 'Repair', 'Repair']
df = pd.DataFrame({'id': id_, 'workcycle': workcycle, 'date': date, 'event': event, 'failure_type': failure_type})

df['is_resolved'] = 'Not resolved'
df['rep'] = True
df.loc[df[df['event'] == 'failure'].index, 'cs'] = 1
df['cs'] = df['cs'].fillna(0).cumsum()

Added one more line to trigger the condition: more than two repairs.

A column 'is_resolved' is created with values 'Not resolved'. 'rep' column set to True (needed to exclude rows with more than 2 repairs). To be able to group by group, 'cs' is created (where ‘failure’ is first set to 1 and the rest to 0 and the cumulative sum is calculated).

def repair_func(x):
    aaa = np.where(len(x) > 2, (np.abs(df.loc[x.index, 'workcycle'].diff().values[1:]) < 30).all(), False)
    if aaa == True:
        df.loc[x.index[0] - 1, 'rep'] = False
        df.loc[x.index, 'rep'] = False


df[df['event'] == 'Repair'].groupby('cs')['event'].apply(repair_func)

Here, rows with 'Repair' are selected from the 'event' column. The result is grouped by 'cs' and any column is selected, in this case ‘event’. np.where checks for more than two rows 'Repair' . If the condition matches, the difference between the rows is calculated and the condition < 30 is checked. If everything is met, then False is set for the entire group.

def failure_func(x):
    f = x.index[0] + 1
    aaa = df.loc[f:]
    bbb = np.where(f <= len(df), (aaa.loc[df.loc[f:, 'failure_type'] == x.values[0], 'workcycle'] < 30).any(), False)
    if bbb == True:
        df.loc[x.index, 'is_resolved'] = 'resolved'

df[(df['event'] == 'failure') & (df['rep'] == True)].groupby('cs')['failure_type'].apply(failure_func)

falure = df[df['event'] == 'failure'][['id', 'date', 'failure_type', 'is_resolved']].copy().reset_index(drop=True)

Here 'failure' is selected from the ‘event’ column except 'rep' == False. There is a grouping by ‘cs’ with a selection of the ‘failure_type’ column. In 'f', this is the next index after the selected row. aaa the selected dataframe by slice. In np.where, it is checked that the slice selection is within the indices: f <= len(df). Next, the same ‘failure_type’ with 'workcycle' < 30 is searched for. If there is at least one such, then 'is_resolved' = 'resolved' is set.

print(falure)

Output

  id        date failure_type   is_resolved
0  A  2022-01-01            A      resolved
1  A  2022-02-04            B  Not resolved
2  B  2022-01-01            A  Not resolved
Answered By: inquirer