Time-series - data splitting and model evaluation Time-series - data splitting and model evaluation r r

Time-series - data splitting and model evaluation


Note that the original question that you have posted, takes care of the timeSlicing, and you don't have to create timeSlices by hand.

However, here is how to use createTimeSlices for splitting the data and then using it for training and testing a model.

Step 0: Setting up the data and trainControl:(from your question)

library(caret)library(ggplot2)library(pls)data(economics)

Step 1: Creating the timeSlices for the index of the data:

timeSlices <- createTimeSlices(1:nrow(economics),                    initialWindow = 36, horizon = 12, fixedWindow = TRUE)

This creates a list of training and testing timeSlices.

> str(timeSlices,max.level = 1)## List of 2## $ train:List of 431##   .. [list output truncated]## $ test :List of 431##   .. [list output truncated]

For ease of understanding, I am saving them in separate variable:

trainSlices <- timeSlices[[1]]testSlices <- timeSlices[[2]]

Step 2: Training on the first of the trainSlices:

plsFitTime <- train(unemploy ~ pce + pop + psavert,                    data = economics[trainSlices[[1]],],                    method = "pls",                    preProc = c("center", "scale"))

Step 3: Testing on the first of the testSlices:

pred <- predict(plsFitTime,economics[testSlices[[1]],])

Step 4: Plotting:

true <- economics$unemploy[testSlices[[1]]]plot(true, col = "red", ylab = "true (red) , pred (blue)", ylim = range(c(pred,true)))points(pred, col = "blue") 

You can then do this for all the slices:

for(i in 1:length(trainSlices)){  plsFitTime <- train(unemploy ~ pce + pop + psavert,                      data = economics[trainSlices[[i]],],                      method = "pls",                      preProc = c("center", "scale"))  pred <- predict(plsFitTime,economics[testSlices[[i]],])      true <- economics$unemploy[testSlices[[i]]]  plot(true, col = "red", ylab = "true (red) , pred (blue)",             main = i, ylim = range(c(pred,true)))  points(pred, col = "blue") }

As mentioned earlier, this sort of timeSlicing is done by your original function in one step:

> myTimeControl <- trainControl(method = "timeslice",+                               initialWindow = 36,+                               horizon = 12,+                               fixedWindow = TRUE)> > plsFitTime <- train(unemploy ~ pce + pop + psavert,+                     data = economics,+                     method = "pls",+                     preProc = c("center", "scale"),+                     trControl = myTimeControl)> plsFitTimePartial Least Squares 478 samples  5 predictorsPre-processing: centered, scaled Resampling: Rolling Forecasting Origin Resampling (12 held-out with a fixed window) Summary of sample sizes: 36, 36, 36, 36, 36, 36, ... Resampling results across tuning parameters:  ncomp  RMSE  Rsquared  RMSE SD  Rsquared SD  1      1080  0.443     796      0.297        2      1090  0.43      845      0.295      RMSE was used to select the optimal model using  the smallest value.The final value used for the model was ncomp = 1. 

Hope this helps!!


Shambho's answer provides decent example of how to use the caret package with TimeSlices, however, it can be misleading in terms of modelling technique. So in order not to misguide future readers that want to use the caret package for predictive modelling on time-series (and here I do not mean autoregressive models), I want to highlight a few things.

The problem with time-series data is that look-ahead bias is easy if one is not careful. In this case, the economics data set has aligned data at their economic reporting dates and not their release date, which is never the case in real live applications (economic data points have different time stamps). Unemployment data may be two months behind the other indicators in terms of release date, which would then introduce a model bias in Shambho's example.

Next, this example is only descriptive statistics and not predictive (forecasting) because the data we want to forecast (unemploy) is not lagged correctly. It merely trains a model to best explain the variation in unemployment (which also in this case is a stationary time-series creating all sorts of issues in modelling process) based on predictor variables at the same economic report dates.

Lastly, the 12-month horizon in this example is not a true multi-period forecasting as Hyndman does it in his examples.

Hyndman on cross-validation for time-series


Actually, you can!

First, let me give you a scholarly article on the topic.

In R:

Using the package caret, createResample can be used to make simple bootstrap samples and createFolds can be used to generate balanced cross–validation groupings from a set of data. So you'll probably want to use createResample. Here's an example of its usage:

data(oil)createDataPartition(oilType, 2)x <- rgamma(50, 3, .5)inA <- createDataPartition(x, list = FALSE)plot(density(x[inA]))rug(x[inA])points(density(x[-inA]), type = "l", col = 4)rug(x[-inA], col = 4)createResample(oilType, 2)createFolds(oilType, 10)createFolds(oilType, 5, FALSE)createFolds(rnorm(21))createTimeSlices(1:9, 5, 1, fixedWindow = FALSE)createTimeSlices(1:9, 5, 1, fixedWindow = TRUE)createTimeSlices(1:9, 5, 3, fixedWindow = TRUE)createTimeSlices(1:9, 5, 3, fixedWindow = FALSE)

The values you see in the createResample function are the data and the number of partitions to create, in this case 2. You can additionally specify if the results should be stored as a list with list = TRUE or list = FALSE.

Additionally, caret contains a function called createTimeSlices that can create the indices for this type of splitting.

The three parameters for this type of splitting are:

  • initialWindow: the initial number of consecutive values in each training set sample
  • horizon: The number of consecutive values in test set sample
  • fixedWindow: A logical: if FALSE, the training set always start at the first sample and the training set size will vary over data splits.

Usage:

createDataPartition(y,                     times = 1,                    p = 0.5,                    list = TRUE,                    groups = min(5, length(y)))createResample(y, times = 10, list = TRUE)createFolds(y, k = 10, list = TRUE, returnTrain = FALSE)createMultiFolds(y, k = 10, times = 5)createTimeSlices(y, initialWindow, horizon = 1, fixedWindow = TRUE)

Sources:

http://caret.r-forge.r-project.org/splitting.html

http://eranraviv.com/blog/bootstrapping-time-series-r-code/

http://rgm3.lab.nig.ac.jp/RGM/R_rdfile?f=caret/man/createDataPartition.Rd&d=R_CC

CARET. Relationship between data splitting and trainControl