How to divide a dataframe based on categorical variables?
Question:
I have a dataset where for some people credit card application is accepted while for others it is declined.
I want to divide the dataset into two datasets; one for which all the credit cards are accepted(card=’yes’) and the other for which all the credit cards are declined(card=’no’).
The dataset is as shown below:
How can I do that?
Answers:
this should work…
df1=credit5[credit5['card']=='yes'] #gets the subset of the df where all 'card' entries are yes
df2=credit5[credit5['card']=='no'] #gets the subset of the df where all 'card' entries are no
One option is to perform a groupby
operation inside a dict
comprehension. This has the added benefit of working for an arbitrary number of categories.
dfs_by_card = {
accepted: sub_df
for accepted, sub_df in credit5.groupby("card")
}
Here is another solution, not much different from @Derek Eden solution.
credit5=pd.DataFrame({'Card':['Yes','Yes','Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No'],'Age':[36, 35, 38, 38, 37, 37, 30, 30, 30, 33],'Income':[4.520, 2.420, 4.500, 2.540, 9.788, 5.268, 6.879, 7.852, 5.562, 4.789]}) #This is creating a dataframe
Actual dataframe:
Card Age Income
0 Yes 36 4.520
1 Yes 35 2.420
2 Yes 38 4.500
3 No 38 2.540
4 No 37 9.788
credit_no = credit5[(credit5['Card'] == 'No')]
output: ‘No’
Card Age Income
3 No 38 2.540
4 No 37 9.788
7 No 30 7.852
8 No 30 5.562
9 No 33 4.789
credit_yes = credit5[(credit5['Card'] == 'Yes')]
output: ‘Yes’
Card Age Income
0 Yes 36 4.520
1 Yes 35 2.420
2 Yes 38 4.500
5 Yes 37 5.268
6 Yes 30 6.879
Let me know if this helps.
Adding on to @Vishwas’s answer, you can get a minor speed boost by reversing the boolean mask.
credit_no = credit5[(credit5['Card'] == 'No')]
credit_yes = ~credit_no
I have a dataset where for some people credit card application is accepted while for others it is declined.
I want to divide the dataset into two datasets; one for which all the credit cards are accepted(card=’yes’) and the other for which all the credit cards are declined(card=’no’).
The dataset is as shown below:
How can I do that?
this should work…
df1=credit5[credit5['card']=='yes'] #gets the subset of the df where all 'card' entries are yes
df2=credit5[credit5['card']=='no'] #gets the subset of the df where all 'card' entries are no
One option is to perform a groupby
operation inside a dict
comprehension. This has the added benefit of working for an arbitrary number of categories.
dfs_by_card = {
accepted: sub_df
for accepted, sub_df in credit5.groupby("card")
}
Here is another solution, not much different from @Derek Eden solution.
credit5=pd.DataFrame({'Card':['Yes','Yes','Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No'],'Age':[36, 35, 38, 38, 37, 37, 30, 30, 30, 33],'Income':[4.520, 2.420, 4.500, 2.540, 9.788, 5.268, 6.879, 7.852, 5.562, 4.789]}) #This is creating a dataframe
Actual dataframe:
Card Age Income
0 Yes 36 4.520
1 Yes 35 2.420
2 Yes 38 4.500
3 No 38 2.540
4 No 37 9.788
credit_no = credit5[(credit5['Card'] == 'No')]
output: ‘No’
Card Age Income
3 No 38 2.540
4 No 37 9.788
7 No 30 7.852
8 No 30 5.562
9 No 33 4.789
credit_yes = credit5[(credit5['Card'] == 'Yes')]
output: ‘Yes’
Card Age Income
0 Yes 36 4.520
1 Yes 35 2.420
2 Yes 38 4.500
5 Yes 37 5.268
6 Yes 30 6.879
Let me know if this helps.
Adding on to @Vishwas’s answer, you can get a minor speed boost by reversing the boolean mask.
credit_no = credit5[(credit5['Card'] == 'No')]
credit_yes = ~credit_no