Combining individual probabilities in Naive Bayesian spam filtering Combining individual probabilities in Naive Bayesian spam filtering php php

Combining individual probabilities in Naive Bayesian spam filtering


Varifying with only the calculator, it seems ok for the non-spam phrase you posted. In that case you have $pProducts a couple order of magnitudes smaller than $pSums.

Try running some real spam from your spam folder, where you'd meet probabilities like 0.8. And guess why spammers sometime try to send a piece of newspaper in a hidden frame along with the message :)


If your filter is not biased (Pr(S)=Pr(H) = 0.5) then: "It is also advisable that the learned set of messages conforms to the 50% hypothesis about repartition between spam and ham, i.e. that the datasets of spam and ham are of same size."

This means that you should teach your Bayesian filter on the similar amount of spam and ham messages. Say 1000 spam messages and 1000 ham messages.

I'd assume (not checked) that if your filter is biased learning set should conform to the hypothesis about any message being spam.


On the idea of compensating for message lengths, you could estimate for each set the probabilities of a message word being a specific word, then use a poisson distribution to estimate the probability of a message of N words containing that specific word.