How can I sample equally from a dataframe?
Question:
Suppose I have some observations, each with an indicated class from 1
to n
. Each of these classes may not necessarily occur equally in the data set.
How can I equally sample from the dataframe? Right now I do something like…
frames = []
classes = df.classes.unique()
for i in classes:
g = df[df.classes = i].sample(sample_size)
frames.append(g)
equally_sampled = pd.concat(frames)
Is there a pandas function to equally sample?
Answers:
For more elegance you can do this:
df.groupby('classes').apply(lambda x: x.sample(sample_size))
Extension:
You can make the sample_size
a function of group size to sample with equal probabilities (or proportionately):
nrows = len(df)
total_sample_size = 1e4
df.groupby('classes').
apply(lambda x: x.sample(int((x.count()/nrows)*total_sample_size)))
It won’t result in the exact number of rows as total_sample_size
but sampling will be more proportional than the naive method.
While the accepted answer is awesome, another approach when the dataset is highly imbalanced:
For example: A dataset has 100K data-points (or rows) out of which 16K data-points are label 0
(-ve class) and remaining 84K data-points are label 1
(+ve class). To extract a sample of size 50K data-points with all 16K -ve class and filling the remaining space with +ve class, we can do below steps:
from sklearn import utils
# Pick all -ve class, fill the sample with +ve class and shuffle.
df = utils.shuffle(df.groupby("class_label").head(50000 - 16000))
# Reset index by dropping old index if not required.
df.reset_index(drop=True, inplace=True) # Optional step.
Suppose I have some observations, each with an indicated class from 1
to n
. Each of these classes may not necessarily occur equally in the data set.
How can I equally sample from the dataframe? Right now I do something like…
frames = []
classes = df.classes.unique()
for i in classes:
g = df[df.classes = i].sample(sample_size)
frames.append(g)
equally_sampled = pd.concat(frames)
Is there a pandas function to equally sample?
For more elegance you can do this:
df.groupby('classes').apply(lambda x: x.sample(sample_size))
Extension:
You can make the sample_size
a function of group size to sample with equal probabilities (or proportionately):
nrows = len(df)
total_sample_size = 1e4
df.groupby('classes').
apply(lambda x: x.sample(int((x.count()/nrows)*total_sample_size)))
It won’t result in the exact number of rows as total_sample_size
but sampling will be more proportional than the naive method.
While the accepted answer is awesome, another approach when the dataset is highly imbalanced:
For example: A dataset has 100K data-points (or rows) out of which 16K data-points are label 0
(-ve class) and remaining 84K data-points are label 1
(+ve class). To extract a sample of size 50K data-points with all 16K -ve class and filling the remaining space with +ve class, we can do below steps:
from sklearn import utils
# Pick all -ve class, fill the sample with +ve class and shuffle.
df = utils.shuffle(df.groupby("class_label").head(50000 - 16000))
# Reset index by dropping old index if not required.
df.reset_index(drop=True, inplace=True) # Optional step.