How can I implement incremental training for xgboost?

Question:

The problem is that my train data could not be placed into RAM due to train data size. So I need a method which first builds one tree on whole train data set, calculate residuals build another tree and so on (like gradient boosted tree do). Obviously if I call model = xgb.train(param, batch_dtrain, 2) in some loop – it will not help, because in such case it just rebuilds whole model for each batch.

Asked By: Marat Zakirov

||

Answers:

Try saving your model after you train on the first batch. Then, on successive runs, provide the xgb.train method with the filepath of the saved model.

Here’s a small experiment that I ran to convince myself that it works:

First, split the boston dataset into training and testing sets.
Then split the training set into halves.
Fit a model with the first half and get a score that will serve as a benchmark.
Then fit two models with the second half; one model will have the additional parameter xgb_model. If passing in the extra parameter didn’t make a difference, then we would expect their scores to be similar..
But, fortunately, the new model seems to perform much better than the first.

import xgboost as xgb
from sklearn.cross_validation import train_test_split as ttsplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse

X = load_boston()['data']
y = load_boston()['target']

# split data into training and testing sets
# then split training set in half
X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0)
X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train, 
                                                     y_train, 
                                                     test_size=0.5,
                                                     random_state=0)

xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)

params = {'objective': 'reg:linear', 'verbose': False}
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')

# ================= train two versions of the model =====================#
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')

print(mse(model_1.predict(xg_test), y_test))     # benchmark
print(mse(model_2_v1.predict(xg_test), y_test))  # "before"
print(mse(model_2_v2.predict(xg_test), y_test))  # "after"

# 23.0475232194
# 39.6776876084
# 27.2053239482

reference: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py

Answered By: Alain

There is now (version 0.6?) a process_update parameter that might help. Here’s an experiment with it:

import pandas as pd
import xgboost as xgb
from sklearn.model_selection import ShuffleSplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse

boston = load_boston()
features = boston.feature_names
X = boston.data
y = boston.target

X=pd.DataFrame(X,columns=features)
y = pd.Series(y,index=X.index)

# split data into training and testing sets
rs = ShuffleSplit(test_size=0.3, n_splits=1, random_state=0)
for train_idx,test_idx in rs.split(X):  # this looks silly
    pass

train_split = round(len(train_idx) / 2)
train1_idx = train_idx[:train_split]
train2_idx = train_idx[train_split:]
X_train = X.loc[train_idx]
X_train_1 = X.loc[train1_idx]
X_train_2 = X.loc[train2_idx]
X_test = X.loc[test_idx]
y_train = y.loc[train_idx]
y_train_1 = y.loc[train1_idx]
y_train_2 = y.loc[train2_idx]
y_test = y.loc[test_idx]

xg_train_0 = xgb.DMatrix(X_train, label=y_train)
xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)

params = {'objective': 'reg:linear', 'verbose': False}
model_0 = xgb.train(params, xg_train_0, 30)
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model=model_1)

params.update({'process_type': 'update',
               'updater'     : 'refresh',
               'refresh_leaf': True})
model_2_v2_update = xgb.train(params, xg_train_2, 30, xgb_model=model_1)

print('full traint',mse(model_0.predict(xg_test), y_test)) # benchmark
print('model 1 t',mse(model_1.predict(xg_test), y_test))  
print('model 2 t',mse(model_2_v1.predict(xg_test), y_test))  # "before"
print('model 1+2t',mse(model_2_v2.predict(xg_test), y_test))  # "after"
print('model 1+update2t',mse(model_2_v2_update.predict(xg_test), y_test))  # "after"

Output:

full train   17.8364309709
model 1      24.2542132108
model 2      25.6967017352
model 1+2    22.8846455135
model 1+update2  14.2816257268
Answered By: paulperry

I created a gist of jupyter notebook to demonstrate that xgboost model can be trained incrementally. I used boston dataset to train the model. I did 3 experiments – one shot learning, iterative one shot learning, iterative incremental learning. In incremental training, I passed the boston data to the model in batches of size 50.

The gist of the gist is that you’ll have to iterate over the data multiple times for the model to converge to the accuracy attained by one shot (all data) learning.

Here is the corresponding code for doing iterative incremental learning with xgboost.

batch_size = 50
iterations = 25
model = None
for i in range(iterations):
    for start in range(0, len(x_tr), batch_size):
        model = xgb.train({
            'learning_rate': 0.007,
            'update':'refresh',
            'process_type': 'update',
            'refresh_leaf': True,
            #'reg_lambda': 3,  # L2
            'reg_alpha': 3,  # L1
            'silent': False,
        }, dtrain=xgb.DMatrix(x_tr[start:start+batch_size], y_tr[start:start+batch_size]), xgb_model=model)

        y_pr = model.predict(xgb.DMatrix(x_te))
        #print('    MSE itr@{}: {}'.format(int(start/batch_size), sklearn.metrics.mean_squared_error(y_te, y_pr)))
    print('MSE itr@{}: {}'.format(i, sklearn.metrics.mean_squared_error(y_te, y_pr)))

y_pr = model.predict(xgb.DMatrix(x_te))
print('MSE at the end: {}'.format(sklearn.metrics.mean_squared_error(y_te, y_pr)))

XGBoost version: 0.6

Answered By: Shubham Chaudhary

If your problem is regarding the dataset size and you do not really need Incremental Learning (you are not dealing with an Streaming app, for instance), then you should check out Spark or Flink.

This two frameworks can train on very large datasets with a small RAM, leveraging disk memory. Both framework deal with memory issues internally. While Flink had it solved first, Spark has caught up in recent releases.

Take a look at:

looks like you don’t need anything other than call your xgb.train(....) again but provide the model result from the previous batch:

# python
params = {} # your params here
ith_batch = 0
n_batches = 100
model = None
while ith_batch < n_batches:
    d_train = getBatchData(ith_batch)
    model = xgb.train(params, d_train, xgb_model=model)
    ith_batch += 1

this is based on https://xgboost.readthedocs.io/en/latest/python/python_api.html
enter image description here

Answered By: Mobigital

To paulperry’s code, If change one line from “train_split = round(len(train_idx) / 2)” to “train_split = len(train_idx) – 50”. model 1+update2 will changed from 14.2816257268 to 45.60806270012028. And a lot of “leaf=0” result in dump file.

Updated model is not good when update sample set is relative small.
For binary:logistic, updated model is unusable when update sample set has only one class.

Answered By: Tao Cheng

One possible solution that I have not tested is to used a dask dataframe which should act the same as a pandas dataframe but (I assume) utilize disk and reads in and out of RAM. here are some helpful links.
this link mentions how to use it with xgboost also see
also see.
further there is an experimental options from XGBoost as well here but it is "not ready for production"

Answered By: Phillip Maire
Hey guys you can use my simple code for incremental model training with xgb base class :

    batch_size = 10000000


    X_train="your pandas training DataFrame" 
    y_train="Your lables"
    
    #Store eval results
    evals_result={}
    Deval = xgb.DMatrix(X_valid, y_valid)
    eval_sets = [(Dtrain, 'train'), (Deval, 'eval')]
    for start in range(0, n, batch_size):
           model = xgb.train({'refresh_leaf': True, 
                         'process_type': 'default', 
                         'max_depth': 5, 
                         'objective': 'reg:squarederror', 
                         'num_parallel_tree': 2,
                        'learning_rate':0.05,
                        'n_jobs':-1},
                        dtrain=xgb.DMatrix(X_train, y_train), evals=eval_sets, early_stopping_rounds=5,num_boost_round=100,evals_result=evals_result,xgb_model=model) 
Answered By: Pankaj Kalal

It’s not based on xgboost, but there is a C++ incremental decision tree.
see gaenari.

Continuous chunking data can be inserted and updated, and rebuilds can be run if concept drift reduces accuracy.

Answered By: greenfish

I agree with @desertnaut in his solution.

I have a dataset where I split it into 4 batches. I have to do an initial fit without the xgb_model parameter first, then the next fits will have the xgb_model parameter, like in this (I’m using the Sklearn API):

for i, (X_batch, y_batch) in enumerate(zip(self.X_train_batched, self.y_train_batched)):
    print(f'Step: {i}',end = ' ')
    if i == 0:
        model_xgbc.fit(X_batch, y_batch, eval_set=[(self.X_valid, self.y_valid)],
                        verbose=False, eval_metric = ['logloss'],
                        early_stopping_rounds = 400)
    else:
        model_xgbc.fit(X_batch, y_batch, eval_set=[(self.X_valid, self.y_valid)],
                        verbose=False, eval_metric = ['logloss'],
                        early_stopping_rounds = 400, xgb_model=model_xgbc)
            
    preds = model_xgbc.predict(self.X_valid)
    
    rmse = metrics.mean_squared_error(self.y_valid, preds,squared=False)
Answered By: J R
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.