Condionally Filtering Dataframe using Pandas

Question:

I have a dataframe df:

Store Date Question Answer
1 44725 B mango
1 44725 A Yes
1 44725 E Yes
1 44725 F mango
1 44625 G No
2 44625 A Yes
2 44625 B
2 44625 C Yes
2 44625 D apple
2 44725 A No
3 44725 A Yes
3 44725 C Yes
3 44725 D strawberry
4 44726 C Yes
4 44726 A Yes

A,E,G,C are Questions and B,F,H,D are sub questions under Question respectively.
For a store for a Date, If the Question is answered Yes,I want to check if follow up /sub questions are present for their Questions, If not I want to get the store,Date,Question pairs as a pandas dataframe.
For example:

For store 2,Date 44625 Question B is null/blank,need this in a separate dataframe like null_df:

Store Date Question
2 44625 B

For store 3,Date 44725,Question A does not have the sub question so want this in a separate dataframe df_check:

Store Date Question
3 44725 A
4 44726 C
4 44726 A

For store 1 for Date 44725 All the questions have sub questions and answers.
How do I get the above 2 dataframes?

Asked By: Scope

||

Answers:

import pandas as pd
import numpy as np

df = pd.read_csv('df.csv', header=0)

aaa = {'A': 'B', 'E': 'F', 'G': 'H', 'C': 'D'}


null_df = pd.DataFrame()
df_check = pd.DataFrame()


def my_func(x):
    ind = x[x['Answer'] == 'Yes'].index
    if len(ind) > 0:
        bbb = np.array([[aaa[x.loc[i, 'Question']], x.loc[i, 'Question']] for i in ind])

        fff = x[x['Question'].isin(bbb[:, 0])]
        nu = fff[pd.isnull(fff['Answer'])]
        if len(nu) > 0:
            global null_df
            null_df = pd.concat([null_df, nu[['Store', 'Date', 'Question']]], ignore_index=True)

        check_ind = np.in1d(bbb[:, 0], x['Question'])
        check_ans = bbb[~ check_ind][:, 1]
        check = x[x['Question'].isin(check_ans)]
        if len(check) > 0:
            global df_check
            df_check = pd.concat([df_check, check[['Store', 'Date', 'Question']]], ignore_index=True)


df.groupby(['Store', 'Date']).apply(my_func)

print(null_df)
print(df_check)

Output

   Store   Date Question
0      2  44625        B


   Store   Date Question
0      3  44725        A
1      4  44726        C
2      4  44726        A

A dictionary aaa is created in which the keys are questions, and the values are sub-questions. To find out which sub-question corresponds to the question.

The dataframe is grouped by the ‘Store’, ‘Date’ columns. Next, apply with a custom function(my_func).
Rows where x['Answer'] == 'Yes' are selected to get indexes. If there are indexes, then we fall under the condition.

The bbb list is a nested list with sub-questions on the left and questions on the right that are selected by indexes where there are ‘Yes’ questions.

null_df:
The fff list gets subquestions. With the isin function, which returns a boolean mask. In nu we get rows where Answer is NaN.

df_check:
check_ind uses np.in1d to check if the selected sub-questions match the real ones. check_ind gets a boolean mask with True sub-questions present. check_ans get a list of questions without sub-questions(apply tilde ~ reversing the condition). check we select lines with the necessary questions using isin in which the list of questions is substituted.

pd.concat is used to connect both dataframes.

If something is not clear, you can print the data of each line of code and see the result.

Answered By: inquirer
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.