Condionally Filtering Dataframe using Pandas
Question:
I have a dataframe df
:
Store
Date
Question
Answer
1
44725
B
mango
1
44725
A
Yes
1
44725
E
Yes
1
44725
F
mango
1
44625
G
No
2
44625
A
Yes
2
44625
B
2
44625
C
Yes
2
44625
D
apple
2
44725
A
No
3
44725
A
Yes
3
44725
C
Yes
3
44725
D
strawberry
4
44726
C
Yes
4
44726
A
Yes
A,E,G,C are Questions and B,F,H,D are sub questions under Question
respectively.
For a store
for a Date
, If the Question is answered Yes
,I want to check if follow up /sub questions are present for their Questions, If not I want to get the store,Date,Question pairs as a pandas dataframe.
For example:
For store
2,Date
44625 Question
B is null/blank,need this in a separate dataframe like null_df
:
Store
Date
Question
2
44625
B
For store
3,Date
44725,Question
A does not have the sub question so want this in a separate dataframe df_check
:
Store
Date
Question
3
44725
A
4
44726
C
4
44726
A
For store
1 for Date
44725 All the questions have sub questions and answers.
How do I get the above 2 dataframes?
Answers:
import pandas as pd
import numpy as np
df = pd.read_csv('df.csv', header=0)
aaa = {'A': 'B', 'E': 'F', 'G': 'H', 'C': 'D'}
null_df = pd.DataFrame()
df_check = pd.DataFrame()
def my_func(x):
ind = x[x['Answer'] == 'Yes'].index
if len(ind) > 0:
bbb = np.array([[aaa[x.loc[i, 'Question']], x.loc[i, 'Question']] for i in ind])
fff = x[x['Question'].isin(bbb[:, 0])]
nu = fff[pd.isnull(fff['Answer'])]
if len(nu) > 0:
global null_df
null_df = pd.concat([null_df, nu[['Store', 'Date', 'Question']]], ignore_index=True)
check_ind = np.in1d(bbb[:, 0], x['Question'])
check_ans = bbb[~ check_ind][:, 1]
check = x[x['Question'].isin(check_ans)]
if len(check) > 0:
global df_check
df_check = pd.concat([df_check, check[['Store', 'Date', 'Question']]], ignore_index=True)
df.groupby(['Store', 'Date']).apply(my_func)
print(null_df)
print(df_check)
Output
Store Date Question
0 2 44625 B
Store Date Question
0 3 44725 A
1 4 44726 C
2 4 44726 A
A dictionary aaa
is created in which the keys are questions
, and the values are sub-questions
. To find out which sub-question corresponds to the question.
The dataframe is grouped by the ‘Store’, ‘Date’ columns. Next, apply
with a custom function(my_func
).
Rows where x['Answer'] == 'Yes'
are selected to get indexes
. If there are indexes, then we fall under the condition.
The bbb
list is a nested list with sub-questions on the left
and questions on the right
that are selected by indexes where there are ‘Yes’ questions.
null_df:
The fff
list gets subquestions. With the isin function, which returns a boolean mask. In nu we get rows where Answer is NaN.
df_check:
check_ind
uses np.in1d to check if the selected sub-questions match the real ones. check_ind gets a boolean mask with True
sub-questions present. check_ans
get a list of questions without sub-questions(apply tilde ~
reversing the condition). check
we select lines with the necessary questions using isin
in which the list of questions is substituted.
pd.concat is used to connect both dataframes.
If something is not clear, you can print
the data of each line of code and see the result.
I have a dataframe df
:
Store | Date | Question | Answer |
---|---|---|---|
1 | 44725 | B | mango |
1 | 44725 | A | Yes |
1 | 44725 | E | Yes |
1 | 44725 | F | mango |
1 | 44625 | G | No |
2 | 44625 | A | Yes |
2 | 44625 | B | |
2 | 44625 | C | Yes |
2 | 44625 | D | apple |
2 | 44725 | A | No |
3 | 44725 | A | Yes |
3 | 44725 | C | Yes |
3 | 44725 | D | strawberry |
4 | 44726 | C | Yes |
4 | 44726 | A | Yes |
A,E,G,C are Questions and B,F,H,D are sub questions under Question
respectively.
For a store
for a Date
, If the Question is answered Yes
,I want to check if follow up /sub questions are present for their Questions, If not I want to get the store,Date,Question pairs as a pandas dataframe.
For example:
For store
2,Date
44625 Question
B is null/blank,need this in a separate dataframe like null_df
:
Store | Date | Question |
---|---|---|
2 | 44625 | B |
For store
3,Date
44725,Question
A does not have the sub question so want this in a separate dataframe df_check
:
Store | Date | Question |
---|---|---|
3 | 44725 | A |
4 | 44726 | C |
4 | 44726 | A |
For store
1 for Date
44725 All the questions have sub questions and answers.
How do I get the above 2 dataframes?
import pandas as pd
import numpy as np
df = pd.read_csv('df.csv', header=0)
aaa = {'A': 'B', 'E': 'F', 'G': 'H', 'C': 'D'}
null_df = pd.DataFrame()
df_check = pd.DataFrame()
def my_func(x):
ind = x[x['Answer'] == 'Yes'].index
if len(ind) > 0:
bbb = np.array([[aaa[x.loc[i, 'Question']], x.loc[i, 'Question']] for i in ind])
fff = x[x['Question'].isin(bbb[:, 0])]
nu = fff[pd.isnull(fff['Answer'])]
if len(nu) > 0:
global null_df
null_df = pd.concat([null_df, nu[['Store', 'Date', 'Question']]], ignore_index=True)
check_ind = np.in1d(bbb[:, 0], x['Question'])
check_ans = bbb[~ check_ind][:, 1]
check = x[x['Question'].isin(check_ans)]
if len(check) > 0:
global df_check
df_check = pd.concat([df_check, check[['Store', 'Date', 'Question']]], ignore_index=True)
df.groupby(['Store', 'Date']).apply(my_func)
print(null_df)
print(df_check)
Output
Store Date Question
0 2 44625 B
Store Date Question
0 3 44725 A
1 4 44726 C
2 4 44726 A
A dictionary aaa
is created in which the keys are questions
, and the values are sub-questions
. To find out which sub-question corresponds to the question.
The dataframe is grouped by the ‘Store’, ‘Date’ columns. Next, apply
with a custom function(my_func
).
Rows where x['Answer'] == 'Yes'
are selected to get indexes
. If there are indexes, then we fall under the condition.
The bbb
list is a nested list with sub-questions on the left
and questions on the right
that are selected by indexes where there are ‘Yes’ questions.
null_df:
The fff
list gets subquestions. With the isin function, which returns a boolean mask. In nu we get rows where Answer is NaN.
df_check:
check_ind
uses np.in1d to check if the selected sub-questions match the real ones. check_ind gets a boolean mask with True
sub-questions present. check_ans
get a list of questions without sub-questions(apply tilde ~
reversing the condition). check
we select lines with the necessary questions using isin
in which the list of questions is substituted.
pd.concat is used to connect both dataframes.
If something is not clear, you can print
the data of each line of code and see the result.