How can I sample equally from a dataframe?

Question:

Suppose I have some observations, each with an indicated class from 1 to n. Each of these classes may not necessarily occur equally in the data set.

How can I equally sample from the dataframe? Right now I do something like…

frames = []
classes = df.classes.unique()

for i in classes:
    g = df[df.classes = i].sample(sample_size)
    frames.append(g)

equally_sampled = pd.concat(frames)

Is there a pandas function to equally sample?

Asked By: Demetri Pananos

||

Answers:

For more elegance you can do this:

df.groupby('classes').apply(lambda x: x.sample(sample_size))

Extension:

You can make the sample_size a function of group size to sample with equal probabilities (or proportionately):

nrows = len(df)
total_sample_size = 1e4
df.groupby('classes').
    apply(lambda x: x.sample(int((x.count()/nrows)*total_sample_size)))

It won’t result in the exact number of rows as total_sample_size but sampling will be more proportional than the naive method.

Answered By: Kartik

While the accepted answer is awesome, another approach when the dataset is highly imbalanced:

For example: A dataset has 100K data-points (or rows) out of which 16K data-points are label 0 (-ve class) and remaining 84K data-points are label 1 (+ve class). To extract a sample of size 50K data-points with all 16K -ve class and filling the remaining space with +ve class, we can do below steps:

from sklearn import utils

# Pick all -ve class, fill the sample with +ve class and shuffle.
df = utils.shuffle(df.groupby("class_label").head(50000 - 16000))

# Reset index by dropping old index if not required.
df.reset_index(drop=True, inplace=True) # Optional step.

enter image description here

Answered By: Dheemanth Bhat
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.