subscript out of bounds in gbm function

r gbm

just a hunch since I can't see you data, but I believe that error occurs when you have variable levels that exist in the test set which don't exist in the training set.

this can easily happen when you have a factor variable with a high number of levels, or one level has a low number of instances.

since you're using CV folds, it's possible the holdout set on one of the loops has foreign levels to the training data.

I'd suggest either:

A) use model.matrix() to one-hot encode your factor variables

B) keep setting different seeds until you get a CV split that doesn't have this error occur.

EDIT: yep, with that traceback, your 3rd CV holdout has a factor level in its test set that doesn't exist in the training. so the predict function sees a foreign value and doesn't know what to do.

EDIT 2: Here's a quick example to show what I mean by "factor levels not in the test set"

#Example data with low occurrences of a factor level:set.seed(222)data = data.frame(cbind( y = sample(0:1, 10, replace = TRUE), x1 = rnorm(10), x2 = as.factor(sample(0:10, 10, replace = TRUE))))data$x2 = as.factor(data$x2)data      y         x1 x2 [1,] 1 -0.2468959  2 [2,] 0 -1.2155609  6 [3,] 0  1.5614051  1 [4,] 0  0.4273102  5 [5,] 1 -1.2010235  5 [6,] 1  1.0524585  8 [7,] 0 -1.3050636  6 [8,] 0 -0.6926076  4 [9,] 1  0.6026489  3[10,] 0 -0.1977531  7#CV fold.  This splits a model to be trained on 80% of the data, then tests against the remaining 20%.  This is a simpler version of what happens when you call gbm's CV fold.CV_train_rows = sample(1:10, 8, replace = FALSE) ; CV_test_rows = setdiff(1:10, CV_train_rows)CV_train = data[CV_train_rows,] ; CV_test = data[CV_test_rows,]#build a model on the training... CV_model = lm(y ~ ., data = CV_train)summary(CV_model)#note here: as the model has been built, it was only fed factor levels (3, 4, 5, 6, 7, 8) for variable x2CV_test$x2#in the test set, there are only levels 1 and 2.#attempt to predict on the test setpredict(CV_model, CV_test)Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor x2 has new levels 1, 2

r gbm

I encounter the same problem and end up solving it by changing one of the hidden function called predict.gbm in the gbm package. This function predict the testing set by the trained gbm object on the training set from the division by cross validation.

The problem is the passed testing set should only have the columns corresponding to the features, so you should modify the function.

CodeHunter

subscript out of bounds in gbm function

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last