how can i get a random sample from dataframe but have it contain a distribution of a variable? PYTHON

Question

context:
i have a large dataframe that looks similar to this but has 200k rows

name	country	id
neymar	brazil	1234
ronaldo	portugal	5678
benzema	france	9012
t. silva	brazil	3456

i want to take a random sample of 100 from this dataframe but ensure i have a few from each country in the random sample – how could i do this? thanks in advance!!

df.sample(100, random_state = 20)

Asked By: codingrainha

||

Source

Answer 1

In order to preserve the distribution by country you could use sklearn.utils.resample setting stratify=df.country.

For example:

from sklearn.utils import resample

resample(df, n_samples=500, replace=False, stratify=df.country, random_state=123)

More details in https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html

Answered By: J. Ferrarons

how can i get a random sample from dataframe but have it contain a distribution of a variable? PYTHON

Question:

Answers: