Filter groupby by combos with either category types

Question:

I have a dataset that looks something like this:

df = pd.DataFrame(
[[1,'A','X','1/2/22 12:00:00AM', 'Alpha'], 
[1,'A','X','1/3/22 12:00:00AM', 'Alpha'], 
[1,'A','X','1/1/22 12:00:00AM', 'Beta'], 
[1,'A','X','1/2/22 1:00:00AM', 'Gamma'], 
[1,'B','Y','1/1/22 1:00:00AM', 'Alpha'],
[2,'A','Z','1/2/22 12:00:00AM', 'Alpha'],
[2,'A','Z','1/1/22 12:00:00AM', 'Alpha'], 
[2,'A','Z','1/1/22 12:00:00AM', 'Beta'], 
columns=['ID', 'Category', 'Site', 'Task Completed', 'Type'])
ID Category Site Task Completed Type
1 A X 1/2/22 12:00:00AM Alpha
1 A X 1/3/22 12:00:00AM Alpha
1 A X 1/1/22 12:00:00AM Beta
1 A X 1/2/22 1:00:00AM Gamma
1 B Y 1/1/22 1:00:00AM Alpha
2 A Z 1/2/22 12:00:00AM Alpha
2 A Z 1/1/22 12:00:00AM Alpha
2 A Z 1/1/22 12:00:00AM Beta

I want to find the Max – Min Task Completed date for all ID/Category/Site combos with type ‘Alpha’. For the combo to be counted, they also need to have at least one other type besides ‘Alpha’. I also want to count the instances of the ‘Alphas’ for the ID/Category/Site combos.

So, for this dataset, my intended result would look like this:

ID Category Site Time Difference # of instances
1 A X 1 2
2 A Z 1 2

I know how to get the counts and time difference if ‘Type’ is not considered:

# convert the "Task Completed" column to datetime:
df["Task Completed"] = pd.to_datetime(df["Task Completed"], dayfirst=False)


x = df.groupby(["ID", "Category", "Site"], as_index=False).agg(
    **{
        "Time Difference": (
            "Task Completed",
            lambda x: (x.max() - x.min()).days,
        ),
        "# of instances": ("Task Completed", "count"),
    }
)

print(x)

Which prints

   ID Category Site  Time Difference  # of instances
0   1        A    X                2               4
1   1        B    Y                0               1
2   2        A    Z                1               2

But I can’t figure out how to consider ‘type’ as well.

Asked By: CowboyCoder

||

Answers:

aaa = []

def my_func(x):
    ind = x['Type'] == 'Alpha'
    alf = x.loc[ind, 'Type'].count()
    if alf >= 2 and len(x[x['Type'] != 'Alpha']) != 0:
        difference = (x.loc[ind, 'Task Completed'].max() - x.loc[ind, 'Task Completed'].min()).days
        bbb = x[ind][['ID', 'Category', 'Site']].values[0]
        bbb = np.insert(bbb, 3, [difference, alf])
        aaa.append(bbb)



df.groupby(["ID", "Category", "Site"], as_index=False).apply(my_func)

df1 = pd.DataFrame(aaa, columns=['ID', 'Category', 'Site', 'Time Difference', '# of instances'])

print(df1)

Output

   ID Category Site  Time Difference  # of instances
0   1        A    X                1               2
1   2        A    Z                1               2

Made a function that I passed when grouping. The variable alf counts how many lines with ‘Alpha’ . If there are two or more rows with ‘Alpha’ and at least one row is not ‘Alpha’, then we read the data. The result is stored in the aaa list, which is inserted into a new dataframe.

Update 08.12.2022
Changed the dataframe now it has become:

   ID Category Site Case     Task Completed   Type
0   1        A    X  AAA  1/2/22 12:00:00AM  Alpha
1   1        A    X  AAA  1/3/22 12:00:00AM  Alpha
2   1        A    X  AAA  1/1/22 12:00:00AM   Beta
3   1        A    X  BBB   1/2/22 1:00:00AM  Gamma
4   1        A    X  AAA  1/4/22 12:00:00AM  Alpha
5   1        B    Y  BBB   1/1/22 1:00:00AM  Alpha
6   2        A    Z  FFF  1/2/22 12:00:00AM  Alpha
7   2        A    Z  FFF  1/1/22 12:00:00AM  Alpha
8   2        A    Z  FFF  1/1/22 12:00:00AM   Beta

Output

   ID Category Site Case  Time Difference  # of instances
0   1        A    X  AAA                2               3
1   2        A    Z  FFF                1               2

Code:

import numpy as np
import pandas as pd

df = pd.read_csv('df.csv', header=0)

df["Task Completed"] = pd.to_datetime(df["Task Completed"], dayfirst=False)

aaa = []

def my_func(x):
    ind = x['Type'] == 'Alpha'
    alf = x.loc[ind, 'Type'].count()
    if alf >= 2 and len(x[x['Type'] != 'Alpha']) != 0:
        difference = (x.loc[ind, 'Task Completed'].max() - x.loc[ind, 'Task Completed'].min()).days
        bbb = x[ind][['ID', 'Category', 'Site', 'Case']].values[0]
        bbb = np.insert(bbb, 4, [difference, alf])
        aaa.append(bbb)



df.groupby(['ID', 'Category', 'Site', 'Case'], as_index=False).apply(my_func)

df1 = pd.DataFrame(aaa, columns=['ID', 'Category', 'Site', 'Case', 'Time Difference', '# of instances'])

print(df1)

What changed:

1.In groupby added column ‘Case’

2.In the bbb variable, a selection of a string from ‘Case’ has been added.

3.np.insert(bbb, 4, [difference, alf]) instead of 3 we insert it into the fourth one.

4.Added ‘Case’ column when creating df1.

Answered By: inquirer
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.