I am trying to make my data balanced as my target variable has multi-class and I want to oversample it to make my data balanced

Question:

Let x contain the variables: print(x)

    Restaurant  Cuisines    Average_Cost    Rating  Votes   Reviews Area
    0   3.526361    0.693147    5.303305    1.504077    2.564949    1.609438    7.214504
    1   1.386294    4.127134    4.615121    1.504077    2.484907    1.609438    5.905362
    2   2.772589    1.386294    5.017280    1.526056    4.605170    3.433987    6.131226
    3   3.912023    2.833213    5.525453    1.547563    5.176150    4.564348    7.643483
    4   3.526361    2.708050    5.303305    1.435085    5.948035    5.046646    6.126869
    ... ... ... ... ... ... ... ...
    11089   3.912023    0.693147    5.525453    1.648659    5.789960    5.046646    3.135494
    11090   1.386294    6.028279    4.615121    1.526056    3.610918    2.833213    7.643483
    11091   1.386294    2.397895    4.615121    1.504077    3.828641    2.944439    5.814131
    11092   1.386294    6.028279    4.615121    1.410987    3.218876    2.302585    5.905362
    11093   1.386294    6.028279    4.615121    1.029619    0.000000    0.000000    5.564520
    11094 rows × 7 columns

And let y be the multi-class target variable. print(y.value_counts())

    30 minutes     7406
    45 minutes     2665
    65 minutes      923
    120 minutes      62
    20 minutes       20
    80 minutes       14
    10 minutes        4
    Name: Delivery_Time, dtype: int64

After exploring the y variable we can see that the 30 minutes class has higher counts compared to the other classes.

To balance these, I tried SMOTETomek to oversample the data. But I got an error:

from imblearn.combine import SMOTETomek
smk = SMOTEtomek(ratio = 1)
x_res, y_res = smk.fit_sample(x,y)

Error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-54-426e8b86623d> in <module>()
        1 from imblearn.combine import SMOTETomek
        2 smk = SMOTETomek(ratio = 1)
----> 3 x_res, y_res = smk.fit_sample(x,y)

2 frames
/usr/local/lib/python3.6/dist-packages/imblearn/utils/_validation.py in _sampling_strategy_float(sampling_strategy, y, sampling_type)
    311     if type_y != 'binary':
    312         raise ValueError(
--> 313             '"sampling_strategy" can be a float only when the type '
    314             'of target is binary. For multi-class, use a dict.')
    315     target_stats = _count_class_sample(y)

ValueError: "sampling_strategy" can be a float only when the type of target is binary. For multi-class, use a dict.
Asked By: Karndeep Singh

||

Answers:

You can just see the actual implementation of Smote:
https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/utils/_validation.py#L355

You just need to pass the dictionary as it’s mentioned in the error. But SMOTE algorithm internally takes care of multi-class setting.

Do:

from imblearn.oversampling import SMOTE
smote=SMOTE("minority")
X,Y=smote.fit_sample(x_train,y_train)
When dict, the keys correspond to the targeted classes. The
values correspond to the desired number of samples for each targeted
class.
Answered By: MAC

I think you should keep the target variables in the same proportion, because SMOTE may give you enhanced and better results on the testing data set, but the model may fail on the new data input from the user(live data).

Its up to you whether to apply SMOTE or not.You can use this code:

from imblearn.oversampling import SMOTE
smote=SMOTE("minority")
X,Y=smote.fit_sample(x_train_data,y_train_data)
Answered By: Aniket Gaikwad