How can I implement incremental training for xgboost? How can I implement incremental training for xgboost? python python

How can I implement incremental training for xgboost?


Try saving your model after you train on the first batch. Then, on successive runs, provide the xgb.train method with the filepath of the saved model.

Here's a small experiment that I ran to convince myself that it works:

First, split the boston dataset into training and testing sets.Then split the training set into halves.Fit a model with the first half and get a score that will serve as a benchmark.Then fit two models with the second half; one model will have the additional parameter xgb_model. If passing in the extra parameter didn't make a difference, then we would expect their scores to be similar..But, fortunately, the new model seems to perform much better than the first.

import xgboost as xgbfrom sklearn.cross_validation import train_test_split as ttsplitfrom sklearn.datasets import load_bostonfrom sklearn.metrics import mean_squared_error as mseX = load_boston()['data']y = load_boston()['target']# split data into training and testing sets# then split training set in halfX_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0)X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train,                                                      y_train,                                                      test_size=0.5,                                                     random_state=0)xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)xg_test = xgb.DMatrix(X_test, label=y_test)params = {'objective': 'reg:linear', 'verbose': False}model_1 = xgb.train(params, xg_train_1, 30)model_1.save_model('model_1.model')# ================= train two versions of the model =====================#model_2_v1 = xgb.train(params, xg_train_2, 30)model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')print(mse(model_1.predict(xg_test), y_test))     # benchmarkprint(mse(model_2_v1.predict(xg_test), y_test))  # "before"print(mse(model_2_v2.predict(xg_test), y_test))  # "after"# 23.0475232194# 39.6776876084# 27.2053239482

reference: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py


There is now (version 0.6?) a process_update parameter that might help. Here's an experiment with it:

import pandas as pdimport xgboost as xgbfrom sklearn.model_selection import ShuffleSplitfrom sklearn.datasets import load_bostonfrom sklearn.metrics import mean_squared_error as mseboston = load_boston()features = boston.feature_namesX = boston.datay = boston.targetX=pd.DataFrame(X,columns=features)y = pd.Series(y,index=X.index)# split data into training and testing setsrs = ShuffleSplit(test_size=0.3, n_splits=1, random_state=0)for train_idx,test_idx in rs.split(X):  # this looks silly    passtrain_split = round(len(train_idx) / 2)train1_idx = train_idx[:train_split]train2_idx = train_idx[train_split:]X_train = X.loc[train_idx]X_train_1 = X.loc[train1_idx]X_train_2 = X.loc[train2_idx]X_test = X.loc[test_idx]y_train = y.loc[train_idx]y_train_1 = y.loc[train1_idx]y_train_2 = y.loc[train2_idx]y_test = y.loc[test_idx]xg_train_0 = xgb.DMatrix(X_train, label=y_train)xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)xg_test = xgb.DMatrix(X_test, label=y_test)params = {'objective': 'reg:linear', 'verbose': False}model_0 = xgb.train(params, xg_train_0, 30)model_1 = xgb.train(params, xg_train_1, 30)model_1.save_model('model_1.model')model_2_v1 = xgb.train(params, xg_train_2, 30)model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model=model_1)params.update({'process_type': 'update',               'updater'     : 'refresh',               'refresh_leaf': True})model_2_v2_update = xgb.train(params, xg_train_2, 30, xgb_model=model_1)print('full train\t',mse(model_0.predict(xg_test), y_test)) # benchmarkprint('model 1 \t',mse(model_1.predict(xg_test), y_test))  print('model 2 \t',mse(model_2_v1.predict(xg_test), y_test))  # "before"print('model 1+2\t',mse(model_2_v2.predict(xg_test), y_test))  # "after"print('model 1+update2\t',mse(model_2_v2_update.predict(xg_test), y_test))  # "after"

Output:

full train   17.8364309709model 1      24.2542132108model 2      25.6967017352model 1+2    22.8846455135model 1+update2  14.2816257268


I created a gist of jupyter notebook to demonstrate that xgboost model can be trained incrementally. I used boston dataset to train the model. I did 3 experiments - one shot learning, iterative one shot learning, iterative incremental learning. In incremental training, I passed the boston data to the model in batches of size 50.

The gist of the gist is that you'll have to iterate over the data multiple times for the model to converge to the accuracy attained by one shot (all data) learning.

Here is the corresponding code for doing iterative incremental learning with xgboost.

batch_size = 50iterations = 25model = Nonefor i in range(iterations):    for start in range(0, len(x_tr), batch_size):        model = xgb.train({            'learning_rate': 0.007,            'update':'refresh',            'process_type': 'update',            'refresh_leaf': True,            #'reg_lambda': 3,  # L2            'reg_alpha': 3,  # L1            'silent': False,        }, dtrain=xgb.DMatrix(x_tr[start:start+batch_size], y_tr[start:start+batch_size]), xgb_model=model)        y_pr = model.predict(xgb.DMatrix(x_te))        #print('    MSE itr@{}: {}'.format(int(start/batch_size), sklearn.metrics.mean_squared_error(y_te, y_pr)))    print('MSE itr@{}: {}'.format(i, sklearn.metrics.mean_squared_error(y_te, y_pr)))y_pr = model.predict(xgb.DMatrix(x_te))print('MSE at the end: {}'.format(sklearn.metrics.mean_squared_error(y_te, y_pr)))

XGBoost version: 0.6