stratified sampling with priors in python

Question

Context

The common scenario of applying stratified sampling is about choosing a random sample that roughly maintains the distribution of the selected variable(s) so that it is representative.

Goal:

The goal is to create a function to perfrom stratified sampling but with some provided proportions of the considered variable instead of the original dataset proportions.

The Function:

def stratified_sampling_prior(df,column,prior_dict,sample_size):
   ...
   return df_sampled

column: this is a categorical variable used to perform stratified sampling.
prior_dict: it contains percentages by category in the selected variable.
df: the input dataset.
sample_size: this is the amount of instances we would like to have the sample.

Example

Here I provide a working data sample:

import pandas as pd

priors_dict = {
  "A":0.2
  "B":0.2
  "C":0.1
  "D":0.5
}


df = pd.DataFrame({"Category":["A"]*10+["B"]*50+["C"]*15+["D"]*100,
             "foo":["foo" for i in range(175)],
             "bar":["bar" for i in range(175)]})

With a traditional stratified sampling with a defined sample_size we would get the following output:

df["Category"].value_counts()/df.shape[0]*100
D    57.14
B    28.57
C     8.57
A     5.71

However, the expected result when using the prior_dict the proportions of the output would be:

df_sample = stratified_sampling_prior(df,"Category",prior_dict,sample_size=100):
df_sample["Category"].value_counts()/df_sample.shape[0]*100
D    50.00
B    20.00
C    10.00
A    20.00

Asked By: PeCaDe

||

Source

Answer 1

From your question it is unclear if you need it to be a probabilistic function. That is, that the expectation of the proportions converge to the prior, or do you wish it to conform to the prior no matter what?

If you want it to conform to the prior then I see 2 major issues:

The randomness of the sampling could potentially be severely hurt – imagine a situation where all of the a category rows should be included.
On the flip side, there are times where it will be virtually impossible to satisfy. If in your example there is 0 examples of A, there is no way to make it account for 20% of the sampling points:

df = pd.DataFrame({"Category":["A"]*0+["B"]*50+["C"]*15+["D"]*100,
             "foo":["foo" for i in range(165)],
             "bar":["bar" for i in range(165)]})

Probabilistic Function

In this case you can use the prior to calculate a per-sample weight. We need the present proportion of the category, this we can obtain by:

df['Category'].value_counts(normalize=True)

D    0.571429
B    0.285714
C    0.085714
A    0.057143

Assuming we begin with weight 1 for each entry, we now know how to scale each point to obtain the new weight:

new_weight = desired_proportion / present_proportion

In the case of D for instance, it means each example weight is new_weight = 1 * (0.5 / 0.571) = 0.875. We need to repeat it for each class.

Here is a snippet that does that:

prior = {
      "A":0.2,
      "B":0.2,
      "C":0.1,
      "D":0.5
}
df['weight'] = 1
present_dist = df['Category'].value_counts(normalize=True)
for cat, p in present_dist.items():
    df.loc[df['Category'] == cat, 'weight'] = prior[cat] / (p + 1e-6)

sampledf = df.sample(weights = df['weight'])

Testing

I ran some experiments in the results show that indeed we converge to the desired prior. I ran 100,000 experiments and this is the distribution that we got:

{'A': 19917, 'B': 19982, 'C': 9975, 'D': 50126}

That corresponds to:

A: 19.92%
B: 19.98%
C: 9.975%
D: 50.13%

Edit: I inflated the df size and used sample size 10,000 to see if per sample we converge to the desired distribution:

# df composition (you can see it vastly differs from our desired prior)
A_l = 90000
B_l = 4500466
C_l = 5243287
D_l = 144144

tot = A_l + B_l + C_l + D_l

df = pd.DataFrame({"Category":["A"]*A_l+["B"]*B_l+["C"]*C_l+["D"]*D_l,
                 "foo":["foo" for _ in range(tot)],
                 "bar":["bar" for _ in range(tot)]})

Here are 10 tests sampling 10k rows:

{'A': 2007, 'B': 2038, 'C': 1029, 'D': 4926}
{'A': 1999, 'B': 1974, 'C': 1042, 'D': 4985}
{'A': 2018, 'B': 2024, 'C': 1011, 'D': 4947}
{'A': 1996, 'B': 2046, 'C': 979, 'D': 4979}
{'A': 2027, 'B': 2012, 'C': 1043, 'D': 4918}
{'A': 1991, 'B': 2031, 'C': 1027, 'D': 4951}
{'A': 1984, 'B': 1984, 'C': 1075, 'D': 4957}
{'A': 1972, 'B': 2014, 'C': 962, 'D': 5052}
{'A': 1975, 'B': 1998, 'C': 962, 'D': 5065}
{'A': 2016, 'B': 1966, 'C': 994, 'D': 5024}

You can see that regardless of the distribution changes we manage to enforce our prior.
If you still want the deterministic fuction then tell me, however I strongly recommend no to use it as it will be mathematically incorrect and will cause you pain later on.

Answered By: Dr. Prof. Patrick

Answer 2

As a result of this thread the following function is obtained to accomplish such task, hope this help the community. Further improvements are wellcome.

import pandas as pd

def stratified_sampling_prior(df,stratify_variable,prior_dict,sample_size, epsilon=1e-6):
  """ By means of a probabilistic function it is fixed the original distribution into a optimal one.
  Input: 
    - df: as an input dataframe.
    - stratify_variable: a string which identifies the colname present in df to perform a stratified weighted sampling with priors by category.
    - prior_dict: a dict with all categories present in stratify_variable and its new proportions.
    - sample_size: the sample size of the output.
  Output:
    - df with the new stratify_variable proportions.
    """
  
  if not all(elem in prior_dict.keys() for elem in list(df[stratify_variable].unique())):
    raise Exception("Update prior dict error: The prior dict has missing categories that are present in the input df.")
    
  # Compute old proportions, the considered hook/variable is bias
  present_dist = df[stratify_variable].value_counts(normalize=True)
  
  # A prior dict is used to correct the old priors with the new ones.
  for cat, p in present_dist.items():
    df.loc[df[stratify_variable] == cat, 'sample_weight'] = prior_dict[cat] / (p + epsilon) 
  
  # Every time the sample is executed there is a probability to have a result, so this is distributed as the prior indicates in a sample_size.
  output_df = df.sample(weights = df['sample_weight'], n=sample_size, replace=False) 

  return output_df

Answered By: PeCaDe