Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

python machine-learning data-mining classification scikit-learn

You have at least two options:

Transform all your data into a categorical representation by computing percentiles for each continuous variables and then binning the continuous variables using the percentiles as bin boundaries. For instance for the height of a person create the following bins: "very small", "small", "regular", "big", "very big" ensuring that each bin contains approximately 20% of the population of your training set. We don't have any utility to perform this automatically in scikit-learn but it should not be too complicated to do it yourself. Then fit a unique multinomial NB on those categorical representation of your data.
Independently fit a gaussian NB model on the continuous part of the data and a multinomial NB model on the categorical part. Then transform all the dataset by taking the class assignment probabilities (with predict_proba method) as new features: np.hstack((multinomial_probas, gaussian_probas)) and then refit a new model (e.g. a new gaussian NB) on the new features.

python machine-learning data-mining classification scikit-learn

The simple answer: multiply result!! it's the same.

Naive Bayes based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features - meaning you calculate the Bayes probability dependent on a specific feature without holding the others - which means that the algorithm multiply each probability from one feature with the probability from the second feature (and we totally ignore the denominator - since it is just a normalizer).

so the right answer is:

calculate the probability from the categorical variables.
calculate the probability from the continuous variables.
multiply 1. and 2.

python machine-learning data-mining classification scikit-learn

Hope I'm not too late. I recently wrote a library called Mixed Naive Bayes, written in NumPy. It can assume a mix of Gaussian and categorical (multinoulli) distributions on the training data features.

https://github.com/remykarem/mixed-naive-bayes

The library is written such that the APIs are similar to scikit-learn's.

In the example below, let's assume that the first 2 features are from a categorical distribution and the last 2 are Gaussian. In the fit() method, just specify categorical_features=[0,1], indicating that Columns 0 and 1 are to follow categorical distribution.

from mixed_naive_bayes import MixedNBX = [[0, 0, 180.9, 75.0],     [1, 1, 165.2, 61.5],     [2, 1, 166.3, 60.3],     [1, 1, 173.0, 68.2],     [0, 2, 178.4, 71.0]]y = [0, 0, 1, 1, 0]clf = MixedNB(categorical_features=[0,1])clf.fit(X,y)clf.predict(X)

Pip installable via pip install mixed-naive-bayes. More information on the usage in the README.md file. Pull requests are greatly appreciated :)

CodeHunter

Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last