How to use the a k-fold cross validation in scikit with naive bayes classifier and NLTK

python scikit-learn nltk cross-validation naivebayes

Your options are to either set this up yourself or use something like NLTK-Trainer since NLTK doesn't directly support cross-validation for machine learning algorithms.

I'd recommend probably just using another module to do this for you but if you really want to write your own code you could do something like the following.

Supposing you want 10-fold, you would have to partition your training set into 10 subsets, train on 9/10, test on the remaining 1/10, and do this for each combination of subsets (10).

Assuming your training set is in a list named training, a simple way to accomplish this would be,

num_folds = 10subset_size = len(training)/num_foldsfor i in range(num_folds):    testing_this_round = training[i*subset_size:][:subset_size]    training_this_round = training[:i*subset_size] + training[(i+1)*subset_size:]    # train using training_this_round    # evaluate against testing_this_round    # save accuracy# find mean accuracy over all rounds

python scikit-learn nltk cross-validation naivebayes

Actually there is no need for a long loop iterations that are provided in the most upvoted answer. Also the choice of classifier is irrelevant (it can be any classifier).

Scikit provides cross_val_score, which does all the looping under the hood.

from sklearn.cross_validation import KFold, cross_val_scorek_fold = KFold(len(y), n_folds=10, shuffle=True, random_state=0)clf = <any classifier>print cross_val_score(clf, X, y, cv=k_fold, n_jobs=1)

python scikit-learn nltk cross-validation naivebayes

I've used both libraries and NLTK for naivebayes sklearn for crossvalidation as follows:

import nltkfrom sklearn import cross_validationtraining_set = nltk.classify.apply_features(extract_features, documents)cv = cross_validation.KFold(len(training_set), n_folds=10, indices=True, shuffle=False, random_state=None, k=None)for traincv, testcv in cv:    classifier = nltk.NaiveBayesClassifier.train(training_set[traincv[0]:traincv[len(traincv)-1]])    print 'accuracy:', nltk.classify.util.accuracy(classifier, training_set[testcv[0]:testcv[len(testcv)-1]])

and at the end I calculated the average accuracy

CodeHunter

How to use the a k-fold cross validation in scikit with naive bayes classifier and NLTK

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last