How to randomly split grouped dataframe in python

Question:

I have the next dataframe:

df = pd.DataFrame({
               "player_id":[1,1,2,2,3,3,4,4,5,5,6,6],
               "year"     :[1,2,1,2,1,2,1,2,1,2,1,2],
               "overall"  :[20,16,7,3,8,80,20,12,9,3,2,1]})

what is the easiest way to randomly sort it grouped by player_id, e.g.

player_id year overall
4 1 80
4 2 20
1 1 20
1 2 16

And then split it 80-20 into a train and testing set where they don’t share any player_id.

Asked By: Diego

||

Answers:

As Quang Hoang suggested in the comments. You can split your ids and then select the data based on those ids.

test_ids = df.player_id.drop_duplicates().sample(frac=0.2).values
#-> array([2])

train_data = df[~df["player_id"].isin(test_ids)]
"""
    player_id  year  overall
0           1     1       20
1           1     2       16
4           3     1        8
5           3     2       80
6           4     1       20
7           4     2       12
8           5     1        9
9           5     2        3
10          6     1        2
11          6     2        1
"""

test_data = df[df["player_id"].isin(test_ids)]
"""
   player_id  year  overall
2          2     1        7
3          2     2        3
"""
Answered By: zaki98
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.