Groupby sample pandas with keeping the groups lower than n if applicable

Question:

I have a dataset, on which I want to do sampling after groupby. In general it can be achieved with df.groupby("some_id").sample(n=100) . But the problem is that some groups have less than 100 samples (and yes replace=True is a choice but what if we want to keep sample less, I mean if the group has more than 100 samples i want to take sample size of 100, if less – leave it as it is). I couldn’t find one example of achieving something similar, and any ideas are appretiated.
For now the only idea I have is to forget about groupby, create lets say list of groups or something like

groups_list=[]

for i in df.some_id.unique():


    groups_list.append(df[df_some_id==i].apply(weird_sampling))

def weird_sampling(df):

    if (df.shape[0]>99):
        return df.sample(100)
    return df

But it seems extremely unefficient

Asked By: Igor sharm

||

Answers:

After some more trials with this problem I came up with this idea, which still may not be the best or most efficient solution, but is already much better and does the job

 df = df.groupby("some_id").apply(lambda x:  x.sample(n = 100) if (x.shape[0]>99) else x)
Answered By: Igor sharm

I think the cleanest answer might be to shuffle your data and then select up to n of each group:

# maximum number of elements in group
n = 100

# sample(frac=1) --> randomise the order
# groupby("some_id").head(n) --> select up to n
df.sample(frac=1).groupby("some_id").head(n)
Answered By: Matt

very similar to above answer by Igor but I guess less hardcoded and using random_state and reset_index. I would advise to use random_state if you want to make your results reproducible.

def sample(df, sample_size, seed):
    return df.sample(n=min(len(df), sample_size), random_state=seed)

df_sample = df.groupby(['some_id']).apply(sample, sample_size=100, seed=67871215).reset_index(drop=True)
df_sample
Answered By: camel_case