How to sample based on long-tail distribution from a pandas dataframe?

Question:

I have a pandas dataframe of 1000 elements, with value counts shown below. I would like to sample from this dataset in a way that the value counts follow a long-tailed distribution. For example, to maintain the long-tailed distribution, sample4 may only end up with a value count of 400.

                           a
 sample1                  750
 sample2                  746
 sample3                  699
 sample4                  652
 sample5                  622
                          ... 
 sample996                  4
 sample997                  3
 sample998                  2
 sample999                  2
 sample1000                 1

I tried using this code:

import numpy as np

# Calculate the frequency of each element in column 'area'
freq = df['a'].value_counts()

# Calculate the probability of selecting each element based on its frequency
prob = freq / freq.sum()

# Sample from the df_wos dataframe without replacement
df_sampled = df.sample(n=len(df), replace=False, weights=prob.tolist())

However, I end up with errors ValueError: Weights and axis to be sampled must be of same length.

Asked By: HumanTorch

||

Answers:

You have duplicated values. So you need to compute prob for all values. You need to use groupby and count instead of value_counts.

freq = df.groupby('Value')['Value'].transform('count')
prob = freq / len(df)
df_sampled = df.sample(n=len(df), replace=False, weights=prob.tolist())
Answered By: I'mahdi
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.