LightGBMError "Check failed: num_data > 0" with Sklearn RandomizedSearchCV

Question:

I’m trying LightGBMRegressor parameter tuning with Sklearn RandomizedSearchCV. I got an error with message below.

error:

LightGBMError: b'Check failed: num_data > 0 at /src/LightGBM/src/io/dataset.cpp, line 27 .n'

I cannot tell why and the specific parameters caused this error. Any of params_dist below was not suitable for train_x.shape:(1630, 1565)?

Please tell me any hints or solutions. Thank you.

LightGBM version: ‘2.0.12’

function caused this error:

def get_lgbm(train_x, train_y, val_x, val_y):
    lgbm = lgb.LGBMRegressor(
                             objective='regression',
                             device='gpu',
                             n_jobs=1,
                            )
    param_dist = {'boosting_type': ['gbdt', 'dart', 'rf'],
                    'num_leaves': sp.stats.randint(2, 1001),
                    'subsample_for_bin': sp.stats.randint(10, 1001),
                    'min_split_gain': sp.stats.uniform(0, 5.0),
                    'min_child_weight': sp.stats.uniform(1e-6, 1e-2),
                    'reg_alpha': sp.stats.uniform(0, 1e-2),
                    'reg_lambda': sp.stats.uniform(0, 1e-2),
                    'tree_learner': ['data', 'feature', 'serial', 'voting' ],
                    'application': ['regression_l1', 'regression_l2', 'regression'],
                    'bagging_freq': sp.stats.randint(1, 11),
                    'bagging_fraction': sp.stats.uniform(1e-3, 0.99),
                    'feature_fraction': sp.stats.uniform(1e-3, 0.99),
                    'learning_rate': sp.stats.uniform(1e-6, 0.99),
                    'max_depth': sp.stats.randint(1, 501),
                    'n_estimators': sp.stats.randint(100, 20001),
                    'gpu_use_dp': [True, False],
                 }
    rscv = RandomizedSearchCV(
                              estimator=lgbm,
                              param_distributions=param_dist,
                              cv=3,
                              n_iter=3000,
                              n_jobs=4,
                              verbose=1,
                              refit=True,
                              fit_params={'eval_set':(val_x, val_y.ravel()),
                                        'early_stopping_rounds':1,
                                        'eval_metric':['l2', 'l1'],
                                        'verbose': False,
                                       },
                            )
    # This line throws error
    rscv = rscv.fit(train_x, 
                    train_y.ravel(), 
                    )
    return rscv.best_estimator_

Too long to put full stack trace, here is on the lightgbm src.

...........................................................................
/opt/conda/lib/python3.6/site-packages/lightgbm/sklearn.py in fit(self=LGBMRegressor(application='regression_l1',
     ..., subsample_freq=1,
       tree_learner='voting'), X=memmap([[-0.80256822,  1.63302752, -0.55377441, ...12.251635  ,
         12.27866017,  1.        ]]), y=array([-1.81712472,  0.        , -1.7366136 ,  0...        ,  0.36258158, -0.13661202,  0.2919708 ]), sample_weight=None, init_score=None, eval_set=(memmap([[-1.16531701, -0.97454256, -1.36807818, ...11.55465037,
         11.55160629,  2.        ]]), array([ 0.58517555, -1.01419878, -0.05787037, -0...64139942,  1.04166667,  0.        , -0.11668611])), eval_names=None, eval_sample_weight=None, eval_init_score=None, eval_metric=['l2', 'l1'], early_stopping_rounds=1, verbose=False, feature_name='auto', categorical_feature='auto', callbacks=None)
    613                                        eval_init_score=eval_init_score,
    614                                        eval_metric=eval_metric,
    615                                        early_stopping_rounds=early_stopping_rounds,
    616                                        verbose=verbose, feature_name=feature_name,
    617                                        categorical_feature=categorical_feature,
--> 618                                        callbacks=callbacks)
        callbacks = None
    619         return self
    620 
    621     base_doc = LGBMModel.fit.__doc__
    622     fit.__doc__ = (base_doc[:base_doc.find('eval_class_weight :')] +

...........................................................................
/opt/conda/lib/python3.6/site-packages/lightgbm/sklearn.py in fit(self=LGBMRegressor(application='regression_l1',
     ..., subsample_freq=1,
       tree_learner='voting'), X=array([[-0.80256822,  1.63302752, -0.55377441, .... 12.251635  ,
        12.27866017,  1.        ]]), y=array([-1.81712472,  0.        , -1.7366136 ,  0...        ,  0.36258158, -0.13661202,  0.2919708 ]), sample_weight=None, init_score=None, group=None, eval_set=[(memmap([[-1.16531701, -0.97454256, -1.36807818, ...11.55465037,
         11.55160629,  2.        ]]), array([ 0.58517555, -1.01419878, -0.05787037, -0...64139942,  1.04166667,  0.        , -0.11668611]))], eval_names=None, eval_sample_weight=None, eval_class_weight=None, eval_init_score=None, eval_group=None, eval_metric=['l2', 'l1'], early_stopping_rounds=1, verbose=False, feature_name='auto', categorical_feature='auto', callbacks=None)
    468                               self.n_estimators, valid_sets=valid_sets, valid_names=eval_names,
    469                               early_stopping_rounds=early_stopping_rounds,
    470                               evals_result=evals_result, fobj=self._fobj, feval=feval,
    471                               verbose_eval=verbose, feature_name=feature_name,
    472                               categorical_feature=categorical_feature,
--> 473                               callbacks=callbacks)
        callbacks = None
    474 
    475         if evals_result:
    476             self._evals_result = evals_result
    477 

...........................................................................
/opt/conda/lib/python3.6/site-packages/lightgbm/engine.py in train(params={'application': 'regression_l1', 'bagging_fraction': 0.0013516565394267757, 'bagging_freq': 8, 'boosting_type': 'dart', 'colsample_bytree': 1.0, 'device': 'gpu', 'feature_fraction': 0.18574060093496944, 'gpu_use_dp': True, 'learning_rate': 0.06354739024799887, 'max_depth': 267, ...}, train_set=<lightgbm.basic.Dataset object>, num_boost_round=11610, valid_sets=[<lightgbm.basic.Dataset object>], valid_names=None, fobj=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', early_stopping_rounds=1, evals_result={}, verbose_eval=False, learning_rates=None, keep_training_booster=False, callbacks={<function print_evaluation.<locals>.callback>, <function early_stopping.<locals>.callback>, <function record_evaluation.<locals>.callback>})
    175     callbacks_before_iter = sorted(callbacks_before_iter, key=attrgetter('order'))
    176     callbacks_after_iter = sorted(callbacks_after_iter, key=attrgetter('order'))
    177 
    178     # construct booster
    179     try:
--> 180         booster = Booster(params=params, train_set=train_set)
        booster = undefined
        params = {'application': 'regression_l1', 'bagging_fraction': 0.0013516565394267757, 'bagging_freq': 8, 'boosting_type': 'dart', 'colsample_bytree': 1.0, 'device': 'gpu', 'feature_fraction': 0.18574060093496944, 'gpu_use_dp': True, 'learning_rate': 0.06354739024799887, 'max_depth': 267, ...}
        train_set = <lightgbm.basic.Dataset object>
    181         if is_valid_contain_train:
    182             booster.set_train_data_name(train_data_name)
    183         for valid_set, name_valid_set in zip(reduced_valid_sets, name_valid_sets):
    184             booster.add_valid(valid_set, name_valid_set)

...........................................................................
/opt/conda/lib/python3.6/site-packages/lightgbm/basic.py in __init__(self=<lightgbm.basic.Booster object>, params={'application': 'regression_l1', 'bagging_fraction': 0.0013516565394267757, 'bagging_freq': 8, 'boosting_type': 'dart', 'colsample_bytree': 1.0, 'device': 'gpu', 'feature_fraction': 0.18574060093496944, 'gpu_use_dp': True, 'learning_rate': 0.06354739024799887, 'max_depth': 267, ...}, train_set=<lightgbm.basic.Dataset object>, model_file=None, silent=False)
   1290             # construct booster object
   1291             self.handle = ctypes.c_void_p()
   1292             _safe_call(_LIB.LGBM_BoosterCreate(
   1293                 train_set.construct().handle,
   1294                 c_str(params_str),
-> 1295                 ctypes.byref(self.handle)))
        self.handle = c_void_p(None)
   1296             # save reference to data
   1297             self.train_set = train_set
   1298             self.valid_sets = []
   1299             self.name_valid_sets = []

...........................................................................
/opt/conda/lib/python3.6/site-packages/lightgbm/basic.py in _safe_call(ret=-1)
     43     ----------
     44     ret : int
     45         return value from API calls
     46     """
     47     if ret != 0:
---> 48         raise LightGBMError(_LIB.LGBM_GetLastError())
     49 
     50 
     51 def is_numeric(obj):
     52     """Check is a number or not, include numpy number etc."""

LightGBMError: b'Check failed: num_data > 0 at /usr/local/src/lightgbm/LightGBM/src/io/dataset.cpp, line 27 .n'
Asked By: Mockun_JPN

||

Answers:

Minimum value of bagging_fraction and feature_fraction could be too small. I changed the distribution to “sp.stats.uniform(loc=0.1, scale=0.9)” and it works.

Answered By: Mockun_JPN

Maybe train_x or train_y is null. You can check it by print the data

Answered By: rosefun

I got the same error in LightGBM Python. In my case, the size of the test dataset was 0 rows. So make sure the size of test/train dataset is not 0 rows.

Answered By: Kamal Chhirang

In my cases, this error always happens when min_sum_hessian_in_leaf = 0 since I do grid search for min_sum_hessian_in_leaf in [0, 2, 4, 5, 6, 7, 8, 9, 10]

After remove the 0 from the list then the error never happen again

Answered By: Jim Chen
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.