C5.0 decision tree - c50 code called exit with value 1
For anyone interested, the data can be found here: http://www.kaggle.com/c/titanic-gettingStarted/data. I think you need to be registered in order to download it.
Regarding your problem, first of I think you meant to write
new_model <- C5.0(train[,-2],train$Survived)
Next, notice the structure of the Cabin
and Embarked
Columns. These two factors have an empty character as a level name (check with levels(train$Embarked)
). This is the point where C50
falls over. If you modify your data such that
levels(train$Cabin)[1] = "missing"levels(train$Embarked)[1] = "missing"
your algorithm will now run without an error.
Just in case. You can take a look to the error by
summary(new_model)
Also this error occurs when there are a special characters in the name of a variable. For example, one will get this error if there is "я"(it's from Russian alphabet) character in the name of a variable.
Here is what worked finally:-
Got this idea after reading this post
library(C50)test$Survived <- NAcombinedData <- rbind(train,test)combinedData$Survived <- factor(combinedData$Survived)# fixing empty character level names levels(combinedData$Cabin)[1] = "missing"levels(combinedData$Embarked)[1] = "missing"new_train <- combinedData[1:891,]new_test <- combinedData[892:1309,]new_model <- C5.0(new_train[,-2],new_train$Survived)new_model_predict <- predict(new_model,new_test)submitC50 <- data.frame(PassengerId=new_test$PassengerId, Survived=new_model_predict)write.csv(submitC50, file="c50dtree.csv", row.names=FALSE)
The intuition behind this is that in this way both the train and test data set will have consistent factor levels.