how can i get a random sample from dataframe but have it contain a distribution of a variable? PYTHON
Question:
context:
i have a large dataframe that looks similar to this but has 200k rows
name
country
id
neymar
brazil
1234
ronaldo
portugal
5678
benzema
france
9012
t. silva
brazil
3456
i want to take a random sample of 100 from this dataframe but ensure i have a few from each country in the random sample – how could i do this? thanks in advance!!
df.sample(100, random_state = 20)
Answers:
In order to preserve the distribution by country you could use sklearn.utils.resample
setting stratify=df.country
.
For example:
from sklearn.utils import resample
resample(df, n_samples=500, replace=False, stratify=df.country, random_state=123)
More details in https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html
context:
i have a large dataframe that looks similar to this but has 200k rows
name | country | id |
---|---|---|
neymar | brazil | 1234 |
ronaldo | portugal | 5678 |
benzema | france | 9012 |
t. silva | brazil | 3456 |
i want to take a random sample of 100 from this dataframe but ensure i have a few from each country in the random sample – how could i do this? thanks in advance!!
df.sample(100, random_state = 20)
In order to preserve the distribution by country you could use sklearn.utils.resample
setting stratify=df.country
.
For example:
from sklearn.utils import resample
resample(df, n_samples=500, replace=False, stratify=df.country, random_state=123)
More details in https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html