how can i get a random sample from dataframe but have it contain a distribution of a variable? PYTHON

Question:

context:
i have a large dataframe that looks similar to this but has 200k rows

name country id
neymar brazil 1234
ronaldo portugal 5678
benzema france 9012
t. silva brazil 3456

i want to take a random sample of 100 from this dataframe but ensure i have a few from each country in the random sample – how could i do this? thanks in advance!!

df.sample(100, random_state = 20)

Asked By: codingrainha

||

Answers:

In order to preserve the distribution by country you could use sklearn.utils.resample setting stratify=df.country.

For example:

from sklearn.utils import resample

resample(df, n_samples=500, replace=False, stratify=df.country, random_state=123)

More details in https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html

Answered By: J. Ferrarons
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.