How to randomly sample and keep only n values of repeating IDs?

Question:

I have a data frame that looks like this:

user_id tweet_id tweet
user123 7658j dogs are super
user245 66721 yes dogs are super
user245 6d343 yes cats are also super
<…> <…> <…>
user245 541238 well I developed allergy on cates

As I check value counts for each user, I have the following results:

id count
user245 456
user123 115
user427 2

I want to subset the data this way that I keep all rows of ids with value counts below 100, and keep 100 randomly sampled rows of the rows with ids where value counts is above 100?

Asked By: Stuck

||

Answers:

You can try:

(df.groupby('user_id', group_keys=False)
   .apply(lambda g: g.sample(n=min(len(g), 100)))
)

Example (with n=3):

df = pd.DataFrame({'id': list('AAAAAABBCDDDD'), 'col': range(13)})
(df.groupby('id', group_keys=False)
   .apply(lambda g: g.sample(n=min(len(g), 3)))
)

Output:

   id  col
0   A    0
4   A    4
3   A    3
7   B    7
6   B    6
8   C    8
12  D   12
11  D   11
9   D    9
Answered By: mozway
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.