Pandas representative sampling across multiple columns

Question

I have a dataframe which represents a population, with each column denoting a different quality/ characteristic of that person. How can I get a sample of that dataframe/ population, which is representative of the population as a whole across all characteristics.

Suppose I have a dataframe which represents a workforce of 650 people as follows:

import pandas as pd
import numpy as np
c = np.random.choice

colours = ['blue', 'yellow', 'green', 'green... no, blue']
knights = ['Bedevere', 'Galahad', 'Arthur', 'Robin', 'Lancelot']
qualities = ['wise', 'brave', 'pure', 'not quite so brave']

df = pd.DataFrame({'name_id':c(range(3000), 650, replace=False),
              'favourite_colour':c(colours, 650),
              'favourite_knight':c(knights, 650),
              'favourite_quality':c(qualities, 650)})

I can get a sample of the above that reflects the distribution of a single column as follows:

# Find the distribution of a particular column using value_counts and normalize:
knight_weight = df['favourite_knight'].value_counts(normalize=True)

# Add this to my dataframe as a weights column:
df['knight_weight'] = df['favourite_knight'].apply(lambda x: knight_weight[x])

# Then sample my dataframe using the weights column I just added as the 'weights' argument:
df_sample = df.sample(140, weights=df['knight_weight'])

This will return a sample dataframe (df_sample) such that:

df_sample['favourite_knight'].value_counts(normalize=True)
is approximately equal to
df['favourite_knight'].value_counts(normalize=True)

My question is this:
How can I generate a sample dataframe (df_sample) such that the above i.e.:

df_sample[column].value_counts(normalize=True)
is approximately equal to
df[column].value_counts(normalize=True)

is true for all columns (except ‘name_id’) instead of just one of them? population of 650 with a sample size of 140 is approximately the sizes I’m working with so performance isn’t too much of an issue. I’ll happily accept solutions that take a couple of minutes to run as this will still be considerably faster than producing the above sample manually. Thank you for any help.

Asked By: Linden

||

Source

Answer 1

You create a combined feature column, weight that one and draw with it as weights:

df["combined"] = list(zip(df["favourite_colour"],
                          df["favourite_knight"],
                          df["favourite_quality"]))

combined_weight = df['combined'].value_counts(normalize=True)

df['combined_weight'] = df['combined'].apply(lambda x: combined_weight[x])

df_sample = df.sample(140, weights=df['combined_weight'])

This will need an additional step of dividing by the count of the specific weight so sum up to 1 – see Ehsan Fathi post.

Answered By: Patrick Artner

Answer 2

I think this will do what you need:

df["combined"] = list(zip(df["favourite_colour"],
                      df["favourite_knight"],
                      df["favourite_quality"]))
weight = df['combined'].value_counts(normalize=True)
counts = df['combined'].value_counts()
df['combined_weight'] = df['routingnumber'].apply(lambda x: 
weight[x]/counts[x])
df_sample = df.sample(140, weights=df['combined_weight'])

Pay attention that normalize=True will divide the total number of that value by the total number of the records. If you use that as the weight for your rows then the weight column won’t sum up to 1 and pandas will normalize it again which will result in the wrong distribution.

Answered By: Ehsan Fathi

Pandas representative sampling across multiple columns

Question:

Answers: