How can I implement incremental training for xgboost?

python machine-learning xgboost

Try saving your model after you train on the first batch. Then, on successive runs, provide the xgb.train method with the filepath of the saved model.

Here's a small experiment that I ran to convince myself that it works:

First, split the boston dataset into training and testing sets.Then split the training set into halves.Fit a model with the first half and get a score that will serve as a benchmark.Then fit two models with the second half; one model will have the additional parameter xgb_model. If passing in the extra parameter didn't make a difference, then we would expect their scores to be similar..But, fortunately, the new model seems to perform much better than the first.

import xgboost as xgbfrom sklearn.cross_validation import train_test_split as ttsplitfrom sklearn.datasets import load_bostonfrom sklearn.metrics import mean_squared_error as mseX = load_boston()['data']y = load_boston()['target']# split data into training and testing sets# then split training set in halfX_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0)X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train,                                                      y_train,                                                      test_size=0.5,                                                     random_state=0)xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)xg_test = xgb.DMatrix(X_test, label=y_test)params = {'objective': 'reg:linear', 'verbose': False}model_1 = xgb.train(params, xg_train_1, 30)model_1.save_model('model_1.model')# ================= train two versions of the model =====================#model_2_v1 = xgb.train(params, xg_train_2, 30)model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')print(mse(model_1.predict(xg_test), y_test))     # benchmarkprint(mse(model_2_v1.predict(xg_test), y_test))  # "before"print(mse(model_2_v2.predict(xg_test), y_test))  # "after"# 23.0475232194# 39.6776876084# 27.2053239482

reference: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py

python machine-learning xgboost

There is now (version 0.6?) a process_update parameter that might help. Here's an experiment with it:

import pandas as pdimport xgboost as xgbfrom sklearn.model_selection import ShuffleSplitfrom sklearn.datasets import load_bostonfrom sklearn.metrics import mean_squared_error as mseboston = load_boston()features = boston.feature_namesX = boston.datay = boston.targetX=pd.DataFrame(X,columns=features)y = pd.Series(y,index=X.index)# split data into training and testing setsrs = ShuffleSplit(test_size=0.3, n_splits=1, random_state=0)for train_idx,test_idx in rs.split(X):  # this looks silly    passtrain_split = round(len(train_idx) / 2)train1_idx = train_idx[:train_split]train2_idx = train_idx[train_split:]X_train = X.loc[train_idx]X_train_1 = X.loc[train1_idx]X_train_2 = X.loc[train2_idx]X_test = X.loc[test_idx]y_train = y.loc[train_idx]y_train_1 = y.loc[train1_idx]y_train_2 = y.loc[train2_idx]y_test = y.loc[test_idx]xg_train_0 = xgb.DMatrix(X_train, label=y_train)xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)xg_test = xgb.DMatrix(X_test, label=y_test)params = {'objective': 'reg:linear', 'verbose': False}model_0 = xgb.train(params, xg_train_0, 30)model_1 = xgb.train(params, xg_train_1, 30)model_1.save_model('model_1.model')model_2_v1 = xgb.train(params, xg_train_2, 30)model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model=model_1)params.update({'process_type': 'update',               'updater'     : 'refresh',               'refresh_leaf': True})model_2_v2_update = xgb.train(params, xg_train_2, 30, xgb_model=model_1)print('full train\t',mse(model_0.predict(xg_test), y_test)) # benchmarkprint('model 1 \t',mse(model_1.predict(xg_test), y_test))  print('model 2 \t',mse(model_2_v1.predict(xg_test), y_test))  # "before"print('model 1+2\t',mse(model_2_v2.predict(xg_test), y_test))  # "after"print('model 1+update2\t',mse(model_2_v2_update.predict(xg_test), y_test))  # "after"

Output:

full train   17.8364309709model 1      24.2542132108model 2      25.6967017352model 1+2    22.8846455135model 1+update2  14.2816257268

python machine-learning xgboost

I created a gist of jupyter notebook to demonstrate that xgboost model can be trained incrementally. I used boston dataset to train the model. I did 3 experiments - one shot learning, iterative one shot learning, iterative incremental learning. In incremental training, I passed the boston data to the model in batches of size 50.

The gist of the gist is that you'll have to iterate over the data multiple times for the model to converge to the accuracy attained by one shot (all data) learning.

Here is the corresponding code for doing iterative incremental learning with xgboost.

batch_size = 50iterations = 25model = Nonefor i in range(iterations):    for start in range(0, len(x_tr), batch_size):        model = xgb.train({            'learning_rate': 0.007,            'update':'refresh',            'process_type': 'update',            'refresh_leaf': True,            #'reg_lambda': 3,  # L2            'reg_alpha': 3,  # L1            'silent': False,        }, dtrain=xgb.DMatrix(x_tr[start:start+batch_size], y_tr[start:start+batch_size]), xgb_model=model)        y_pr = model.predict(xgb.DMatrix(x_te))        #print('    MSE itr@{}: {}'.format(int(start/batch_size), sklearn.metrics.mean_squared_error(y_te, y_pr)))    print('MSE itr@{}: {}'.format(i, sklearn.metrics.mean_squared_error(y_te, y_pr)))y_pr = model.predict(xgb.DMatrix(x_te))print('MSE at the end: {}'.format(sklearn.metrics.mean_squared_error(y_te, y_pr)))

XGBoost version: 0.6

CodeHunter

How can I implement incremental training for xgboost?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last