Scikit-learn balanced subsampling

Question:

I’m trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it myself? Any pointers to code that does this?

These subsamples should be random and can be overlapping as I feed each to separate classifier in a very large ensemble of classifiers.

In Weka there is tool called spreadsubsample, is there equivalent in sklearn?
http://wiki.pentaho.com/display/DATAMINING/SpreadSubsample

(I know about weighting but that’s not what I’m looking for.)

Asked By: mikkom

||

Answers:

This type of data splitting is not provided among the built-in data splitting techniques exposed in sklearn.cross_validation.

What seems similar to your needs is sklearn.cross_validation.StratifiedShuffleSplit, which can generate subsamples of any size while retaining the structure of the whole dataset, i.e. meticulously enforcing the same unbalance that is in your main dataset. While this is not what you are looking for, you may be able to use the code therein and change the imposed ratio to 50/50 always.

(This would probably be a very good contribution to scikit-learn if you feel up to it.)

Answered By: eickenberg

Here is my first version that seems to be working fine, feel free to copy or make suggestions on how it could be more efficient (I have quite a long experience with programming in general but not that long with python or numpy)

This function creates single random balanced subsample.

edit: The subsample size now samples down minority classes, this should probably be changed.

def balanced_subsample(x,y,subsample_size=1.0):

    class_xs = []
    min_elems = None

    for yi in np.unique(y):
        elems = x[(y == yi)]
        class_xs.append((yi, elems))
        if min_elems == None or elems.shape[0] < min_elems:
            min_elems = elems.shape[0]

    use_elems = min_elems
    if subsample_size < 1:
        use_elems = int(min_elems*subsample_size)

    xs = []
    ys = []

    for ci,this_xs in class_xs:
        if len(this_xs) > use_elems:
            np.random.shuffle(this_xs)

        x_ = this_xs[:use_elems]
        y_ = np.empty(use_elems)
        y_.fill(ci)

        xs.append(x_)
        ys.append(y_)

    xs = np.concatenate(xs)
    ys = np.concatenate(ys)

    return xs,ys

For anyone trying to make the above work with a Pandas DataFrame, you need to make a couple of changes:

  1. Replace the np.random.shuffle line with

    this_xs = this_xs.reindex(np.random.permutation(this_xs.index))

  2. Replace the np.concatenate lines with

    xs = pd.concat(xs)
    ys = pd.Series(data=np.concatenate(ys),name='target')

Answered By: mikkom

My subsampler version, hope this helps

def subsample_indices(y, size):
    indices = {}
    target_values = set(y_train)
    for t in target_values:
        indices[t] = [i for i in range(len(y)) if y[i] == t]
    min_len = min(size, min([len(indices[t]) for t in indices]))
    for t in indices:
        if len(indices[t]) > min_len:
            indices[t] = random.sample(indices[t], min_len)
    return indices

x = [1, 1, 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, -1]
j = subsample_indices(x, 2)
print j
print [x[t] for t in j[-1]]
print [x[t] for t in j[1]]
Answered By: hernan

Below is my python implementation for creating balanced data copy.
Assumptions:
1. target variable (y) is binary class (0 vs. 1)
2. 1 is the minority.

from numpy import unique
from numpy import random 

def balanced_sample_maker(X, y, random_seed=None):
    """ return a balanced data set by oversampling minority class 
        current version is developed on assumption that the positive
        class is the minority.

    Parameters:
    ===========
    X: {numpy.ndarrray}
    y: {numpy.ndarray}
    """
    uniq_levels = unique(y)
    uniq_counts = {level: sum(y == level) for level in uniq_levels}

    if not random_seed is None:
        random.seed(random_seed)

    # find observation index of each class levels
    groupby_levels = {}
    for ii, level in enumerate(uniq_levels):
        obs_idx = [idx for idx, val in enumerate(y) if val == level]
        groupby_levels[level] = obs_idx

    # oversampling on observations of positive label
    sample_size = uniq_counts[0]
    over_sample_idx = random.choice(groupby_levels[1], size=sample_size, replace=True).tolist()
    balanced_copy_idx = groupby_levels[0] + over_sample_idx
    random.shuffle(balanced_copy_idx)

    return X[balanced_copy_idx, :], y[balanced_copy_idx]
Answered By: beingzy

A version for pandas Series:

import numpy as np

def balanced_subsample(y, size=None):

    subsample = []

    if size is None:
        n_smp = y.value_counts().min()
    else:
        n_smp = int(size / len(y.value_counts().index))

    for label in y.value_counts().index:
        samples = y[y == label].index.values
        index_range = range(samples.shape[0])
        indexes = np.random.choice(index_range, size=n_smp, replace=False)
        subsample += samples[indexes].tolist()

    return subsample
Answered By: gc5

Here is a version of the above code that works for multiclass groups (in my tested case group 0, 1, 2, 3, 4)

import numpy as np
def balanced_sample_maker(X, y, sample_size, random_seed=None):
    """ return a balanced data set by sampling all classes with sample_size 
        current version is developed on assumption that the positive
        class is the minority.

    Parameters:
    ===========
    X: {numpy.ndarrray}
    y: {numpy.ndarray}
    """
    uniq_levels = np.unique(y)
    uniq_counts = {level: sum(y == level) for level in uniq_levels}

    if not random_seed is None:
        np.random.seed(random_seed)

    # find observation index of each class levels
    groupby_levels = {}
    for ii, level in enumerate(uniq_levels):
        obs_idx = [idx for idx, val in enumerate(y) if val == level]
        groupby_levels[level] = obs_idx
    # oversampling on observations of each label
    balanced_copy_idx = []
    for gb_level, gb_idx in groupby_levels.iteritems():
        over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
        balanced_copy_idx+=over_sample_idx
    np.random.shuffle(balanced_copy_idx)

    return (X[balanced_copy_idx, :], y[balanced_copy_idx], balanced_copy_idx)

This also returns the indices so they can be used for other datasets and to keep track of how frequently each data set was used (helpful for training)

Answered By: Kevin Mader

Although it is already answered, I stumbled upon your question looking for something similar. After some more research, I believe sklearn.model_selection.StratifiedKFold can be used for this purpose:

from sklearn.model_selection import StratifiedKFold

X = samples_array
y = classes_array # subsamples will be stratified according to y
n = desired_number_of_subsamples

skf = StratifiedKFold(n, shuffle = True)

batches = []
for _, batch in skf.split(X, y):
    do_something(X[batch], y[batch])

It’s important that you add the _ because since skf.split() is used to create stratified folds for K-fold cross-validation, it returns two lists of indices: train (n - 1 / n elements) and test (1 / n elements).

Please note that this is as of sklearn 0.18. In sklearn 0.17 the same function can be found in module cross_validation instead.

Answered By: kadu

A short, pythonic solution to balance a pandas DataFrame either by subsampling (uspl=True) or oversampling (uspl=False), balanced by a specified column in that dataframe that has two or more values.

For uspl=True, this code will take a random sample without replacement of size equal to the smallest stratum from all strata. For uspl=False, this code will take a random sample with replacement of size equal to the largest stratum from all strata.

def balanced_spl_by(df, lblcol, uspl=True):
    datas_l = [ df[df[lblcol]==l].copy() for l in list(set(df[lblcol].values)) ]
    lsz = [f.shape[0] for f in datas_l ]
    return pd.concat([f.sample(n = (min(lsz) if uspl else max(lsz)), replace = (not uspl)).copy() for f in datas_l ], axis=0 ).sample(frac=1) 

This will only work with a Pandas DataFrame, but that seems to be a common application, and restricting it to Pandas DataFrames significantly shortens the code as far as I can tell.

Answered By: Roko Mijic

A slight modification to the top answer by mikkom.

If you want to preserve ordering of the larger class data, ie. you don’t want to shuffle.

Instead of

    if len(this_xs) > use_elems:
        np.random.shuffle(this_xs)

do this

        if len(this_xs) > use_elems:
            ratio = len(this_xs) / use_elems
            this_xs = this_xs[::ratio]
Answered By: Bert Kellerman

There now exists a full-blown python package to address imbalanced data. It is available as a sklearn-contrib package at https://github.com/scikit-learn-contrib/imbalanced-learn

Answered By: eickenberg

Simply select 100 rows in each class with duplicates using the following code. activity is my classes (labels of the dataset)

balanced_df=Pdf_train.groupby('activity',as_index = False,group_keys=False).apply(lambda s: s.sample(100,replace=True))
Answered By: javac

Here is my solution, which can be tightly integrated in an existing sklearn pipeline:

from sklearn.model_selection import RepeatedKFold
import numpy as np


class DownsampledRepeatedKFold(RepeatedKFold):

    def split(self, X, y=None, groups=None):
        for i in range(self.n_repeats):
            np.random.seed()
            # get index of major class (negative)
            idxs_class0 = np.argwhere(y == 0).ravel()
            # get index of minor class (positive)
            idxs_class1 = np.argwhere(y == 1).ravel()
            # get length of minor class
            len_minor = len(idxs_class1)
            # subsample of major class of size minor class
            idxs_class0_downsampled = np.random.choice(idxs_class0, size=len_minor)
            original_indx_downsampled = np.hstack((idxs_class0_downsampled, idxs_class1))
            np.random.shuffle(original_indx_downsampled)
            splits = list(self.cv(n_splits=self.n_splits, shuffle=True).split(original_indx_downsampled))

            for train_index, test_index in splits:
                yield original_indx_downsampled[train_index], original_indx_downsampled[test_index]

    def __init__(self, n_splits=5, n_repeats=10, random_state=None):
        self.n_splits = n_splits
         super(DownsampledRepeatedKFold, self).__init__(
        n_splits=n_splits, n_repeats=n_repeats, random_state=random_state
    )

Use it as usual:

    for train_index, test_index in DownsampledRepeatedKFold(n_splits=5, n_repeats=10).split(X, y):
         X_train, X_test = X[train_index], X[test_index]
         y_train, y_test = y[train_index], y[test_index]
Answered By: spaenigs

I found the best solutions here

And this is the one I think it’s the simplest.

dataset = pd.read_csv("data.csv")
X = dataset.iloc[:, 1:12].values
y = dataset.iloc[:, 12].values

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X, y)

then you can use X_rus, y_rus data

For versions 0.4<=:

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler()
X_rus, y_rus= rus.fit_sample(X, y)

Then, indices of the samples randomly selected can be reached by sample_indices_ attribute.

Answered By: LinNotFound

Here’s a solution which is:

  • simple (< 10 lines code)
  • fast (besides one for loop, pure NumPy)
  • no external dependencies other than NumPy
  • is very cheap to generate new balanced random samples (just call np.random.sample()). Useful for generating different shuffled & balanced samples between training epochs
def stratified_random_sample_weights(labels):
    sample_weights = np.zeros(num_samples)
    for class_i in range(n_classes):
        class_indices = np.where(labels[:, class_i]==1)  # find indices where class_i is 1
        class_indices = np.squeeze(class_indices)  # get rid of extra dim
        num_samples_class_i = len(class_indices)
        assert num_samples_class_i > 0, f"No samples found for class index {class_i}"
        
        sample_weights[class_indices] = 1.0/num_samples_class_i  # note: samples with no classes present will get weight=0

    return sample_weights / sample_weights.sum()  # sum(weights) == 1

Then, you use re-use these weights over and over to generate balanced indices with np.random.sample():

sample_weights = stratified_random_sample_weights(labels)
chosen_indices = np.random.choice(list(range(num_samples)), size=sample_size, replace=True, p=sample_weights)

Full example:

# generate data
from sklearn.preprocessing import OneHotEncoder

num_samples = 10000
n_classes = 10
ground_truth_class_weights = np.logspace(1,3,num=n_classes,base=10,dtype=float)  # exponentially growing
ground_truth_class_weights /= ground_truth_class_weights.sum()  # sum to 1
labels = np.random.choice(list(range(n_classes)), size=num_samples, p=ground_truth_class_weights)
labels = labels.reshape(-1, 1)  # turn each element into a list
labels = OneHotEncoder(sparse=False).fit_transform(labels)


print(f"original counts: {labels.sum(0)}")
# [  38.   76.  127.  191.  282.  556.  865. 1475. 2357. 4033.]

sample_weights = stratified_random_sample_weights(labels)
sample_size = 1000
chosen_indices = np.random.choice(list(range(num_samples)), size=sample_size, replace=True, p=sample_weights)

print(f"rebalanced counts: {labels[chosen_indices].sum(0)}")
# [104. 107.  88. 107.  94. 118.  92.  99. 100.  91.]
Answered By: crypdick

Here my 2 cents. Assume that we have the following unbalanced dataset:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Category': np.random.choice(['A','B','C'], size=1000, replace=True, p=[0.3, 0.5, 0.2]),
                   'Sentiment': np.random.choice([0,1], size=1000, replace=True, p=[0.35, 0.65]),
                   'Gender': np.random.choice(['M','F'], size=1000, replace=True, p=[0.70, 0.30])})
print(df.head())

The first rows:

  Category  Sentiment Gender
0        C          1      M
1        B          0      M
2        B          0      M
3        B          0      M
4        A          0      M

Assume now that we want to get a balanced dataset by Sentiment:

df_grouped_by = df.groupby(['Sentiment'])

df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))

df_balanced = df_balanced.droplevel(['Sentiment'])
df_balanced
print(df_balanced.head())

The first rows of the balanced dataset:

  Category  Sentiment Gender
0        C          0      F
1        C          0      M
2        C          0      F
3        C          0      M
4        C          0      M

Let’s verify that it is balanced in terms of Sentiment

df_balanced.groupby(['Sentiment']).size()

We get:

Sentiment
0    369
1    369
dtype: int64

As we can see we ended up with 369 positive and 369 negative Sentiment labels.

Answered By: George Pipis