"OverflowError: Python int too large to convert to C long" when running a RandomizedSearchCV with scipy distributions

Question:

I want to run the following RandomizedSearch:

from scipy.stats import reciprocal, uniform

tree_reg = DecisionTreeRegressor()

param_grid = {
    "max_depth": np.arange(1, 12, 1),
    "min_samples_leaf": np.arange(2, 2259, 10),
    "min_samples_split": np.arange(2, 2259, 2),
    "max_leaf_nodes": np.arange(2, 2259, 2),
    "max_features": np.arange(2, len(features))
    }

rnd_search_tree = RandomizedSearchCV(tree_reg, param_grid,cv=cv_split, n_iter=10000,
                                    scoring=['neg_root_mean_squared_error', 'r2'], refit='neg_root_mean_squared_error',
                                    return_train_score=True, verbose=2)

rnd_search_tree.fit(dataset_prepared_stand, dataset_labels)

Where 2259 is the number of data points I have. However, I get the following error:

OverflowError                             Traceback (most recent call last)
<ipython-input-809-76074980f31c> in <module>
     13                                     return_train_score=True, verbose=2)
     14 
---> 15 rnd_search_tree.fit(dataset_prepared_stand, dataset_labels)

~anaconda3envsdata_analysislibsite-packagessklearnutilsvalidation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~anaconda3envsdata_analysislibsite-packagessklearnmodel_selection_search.py in fit(self, X, y, groups, **fit_params)
    734                 return results
    735 
--> 736             self._run_search(evaluate_candidates)
    737 
    738         # For multi-metric evaluation, store the best_index_, best_params_ and

~anaconda3envsdata_analysislibsite-packagessklearnmodel_selection_search.py in _run_search(self, evaluate_candidates)
   1529         evaluate_candidates(ParameterSampler(
   1530             self.param_distributions, self.n_iter,
-> 1531             random_state=self.random_state))

~anaconda3envsdata_analysislibsite-packagessklearnmodel_selection_search.py in evaluate_candidates(candidate_params)
    698 
    699             def evaluate_candidates(candidate_params):
--> 700                 candidate_params = list(candidate_params)
    701                 n_candidates = len(candidate_params)
    702 

~anaconda3envsdata_analysislibsite-packagessklearnmodel_selection_search.py in __iter__(self)
    283                 n_iter = grid_size
    284             for i in sample_without_replacement(grid_size, n_iter,
--> 285                                                 random_state=rng):
    286                 yield param_grid[i]
    287 

sklearnutils_random.pyx in sklearn.utils._random.sample_without_replacement()

OverflowError: Python int too large to convert to C long

I do not get it if I’m taking away even just one of the parameters to search over (or if I reduce the step of the range to 1000 for example). Is there a way to solve it passing all the values I’d like to try?

Asked By: giacrava

||

Answers:

I don’t see an alternative to dropping RandomizedSearchCV. Internally RandomSearchCV calls sample_without_replacement to sample from your feature space. When your feature space is larger than C’s long size, scikit-learn’s sample_without_replacement simply breaks down.

Luckily, random search kind of sucks anyway. Check out optuna as an alternative. It is way smarter about where in your feature space to spend time evaluating (paying more attention to high-performing areas), and does not require you to limit your feature space precision beforehand (that is, you can omit the step size). More generally, check out the field of AutoML.

If you insist on random search however, you’ll have to find another implementation. Actually, optuna also supports a random sampler.

Answered By: orlp

You can sample n_iter combinations yourself beforehand and perform a GridSearchCV over that random subgrid:

import random

def sample_grid(full_grid, n_iter, random_state=None):
    """
    sklearn's ParameterSampler (which, e.g., RandomizedSearchCV uses)
    hits overflow error if grid is too large, so we roll our own by
    producing a list-of-dicts amenable to be used in a GridSearchCV.
    """
    random.seed(random_state)
    return [{param_name: [random.choice(param_possibilities)]
             for param_name, param_possibilities in full_grid.items()}
            for _ in range(n_iter)]


# Somewhere down the road...
param_grid = {...}

gs = GridSearchCV(model,
                  param_grid=sample_grid(param_grid, n_iter=1_000),
                  ...)
Answered By: Mustafa Aydın