How to randomly sample and keep only n values of repeating IDs?

Question

I have a data frame that looks like this:

user_id	tweet_id	tweet
user123	7658j	dogs are super
user245	66721	yes dogs are super
user245	6d343	yes cats are also super
<…>	<…>	<…>
user245	541238	well I developed allergy on cates

As I check value counts for each user, I have the following results:

id	count
user245	456
user123	115
user427	2

I want to subset the data this way that I keep all rows of ids with value counts below 100, and keep 100 randomly sampled rows of the rows with ids where value counts is above 100?

Asked By: Stuck

||

Source

Answer 1

You can try:

(df.groupby('user_id', group_keys=False)
   .apply(lambda g: g.sample(n=min(len(g), 100)))
)

Example (with n=3):

df = pd.DataFrame({'id': list('AAAAAABBCDDDD'), 'col': range(13)})
(df.groupby('id', group_keys=False)
   .apply(lambda g: g.sample(n=min(len(g), 3)))
)

Output:

Answered By: mozway

How to randomly sample and keep only n values of repeating IDs?

Question:

Answers: