Combining individual probabilities in Naive Bayesian spam filtering

php probability spam-prevention

Varifying with only the calculator, it seems ok for the non-spam phrase you posted. In that case you have $pProducts a couple order of magnitudes smaller than $pSums.

Try running some real spam from your spam folder, where you'd meet probabilities like 0.8. And guess why spammers sometime try to send a piece of newspaper in a hidden frame along with the message :)

php probability spam-prevention

If your filter is not biased (Pr(S)=Pr(H) = 0.5) then: "It is also advisable that the learned set of messages conforms to the 50% hypothesis about repartition between spam and ham, i.e. that the datasets of spam and ham are of same size."

This means that you should teach your Bayesian filter on the similar amount of spam and ham messages. Say 1000 spam messages and 1000 ham messages.

I'd assume (not checked) that if your filter is biased learning set should conform to the hypothesis about any message being spam.

php probability spam-prevention

On the idea of compensating for message lengths, you could estimate for each set the probabilities of a message word being a specific word, then use a poisson distribution to estimate the probability of a message of N words containing that specific word.

CodeHunter

Combining individual probabilities in Naive Bayesian spam filtering

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last