select randomly rows from a dataframe based on a column value

Question:

I have a data frame called df of which its value counts are the following:

df.Priority.value_counts()

P3    39506
P2    3038 
P4    1138 
P1    1117 
P5    252  
Name: Priority, dtype: int64

I am trying to create a balanced dataset called df_balanced from df by restricting the number of entries in the P3 category to 5000. The expected output should look like this!

P3    5000
P2    3038 
P4    1138 
P1    1117 
P5    252  
Name: Priority, dtype: int64

I tried the following code:

s0 = df.Priority[df.Priority.eq('P3')].sample(5000).index

df_balanced = df.loc[s0.union(df)].reset_index(drop=True, inplace=True)  # I am unsure how to exclude the entries of `P3` categories from `df`!

I used this as a reference: Randomly selecting rows from a dataframe based on a column value but the solution provided isn’t optimal for more than 2 categories.

Asked By: Joe

||

Answers:

A possible solution:

import random

# this is the maximum limit of elements of P1, which will be
# randomly chosen
maxlim_catP1 = 4

df.groupby('X').apply(
    lambda g: g.loc[random.sample(g.index.to_list(), min(maxlim_catP1, len(g))), :] if
    (g.loc[g.index[0], 'X'] == 'P1') else g)

Output:

       X  Y
X          
P1 2  P1  c
   3  P1  d
   0  P1  a
   1  P1  b
P2 4  P2  e
   6  P2  g
   7  P2  h

Data:

    X  Y
0  P1  a
1  P1  b
2  P1  c
3  P1  d
4  P2  e
5  P1  f
6  P2  g
7  P2  h
8  P1  i
Answered By: PaulS
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.