Skip to main content



Bayes’ Theorem in Spam Filtering

The idea behind Bayes’ Theorem, as we saw in class, is quite simple — change your expectations based on any new information that you receive. However, if applied wisely, the same idea can be used to great effect. One application of the theorem that all of us benefit from (sometimes unknowingly) is that of spam filtering. Bayes’ Theorem forms the mathematical foundation of the Naive Bayes spam filtering technique, which is widely using by email services nowadays to keep spam out of our inboxes.

Spam filters face a binary decision every time an email is received. The filter can either classify the email as spam, or not spam. This decision unfortunately cannot be deterministic and static. For example, the spam that a person in the US receives would generally not resemble the typical spam received by someone, for example, in China. Moreover, if the spam filter were deterministic, one could easily find a way past it. Since it is static, one could then keep exploiting that hole in the filter and render it useless. Thus, spam filters need to be able to, firstly, adapt to every client, and secondly, be able to evolve over time. This is a problem perfectly suited to Bayes’ Theorem — as the filter learns more about what can be safely classified as spam, and what cannot, it can make smarter decisions about future incoming emails.

As far as implementation goes, there are several ways of actually making a Naive Bayes spam filter. The underlying technique though, remains the same. The filter is first ‘trained’ on emails that have been pre-classified (by a human) as spam or not spam. These emails form the ‘training set’ of the filter. Now, whenever an email comes in, the filter looks at the contents of the email to decide whether it is spam or not. In particular, it looks at every word, and uses Bayes’ Theorem to calculate the probability of the email being spam, given that it contains a particular word. To apply Bayes’ Theorem, it derives its prior probabilities using the training set data. This calculation is carried out for each word in the email, and depending on the results, the email is classified into one of the two categories. Now that a decision was made on this email, the filter adds the email into the training set, so that this decision can be used next time to yield and even better filtering result. (a slightly simplified mathematical explanation can be found here)

Of course, it’s quite possible that the filter wrongly classifies an email. However, the likelihood of this happening has been proven to be quite less, as long as the filter is implemented correctly and the initial training set is comprehensive enough. The biggest advantage of the Naive Bayes spam filter is that it evolves over time. If we explicitly mark some email as spam, the changes get reflected in the training set. Thus, for every subsequent email, the previous decisions help the filter make smarter decisions in the future. Additionally, since the classifier learns based on every individual’s inbox, it is highly customized to the user’s needs. This is important because the same email might be spam for someone, but not spam for someone else. Thus, the Naive Bayes filter has several advantages and is widely used as a spam filter nowadays. Its success rate is very, very high, and it’s fascinating to think that it originates from such a simple, yet logical theorem.

 

Sources:

http://www.cad.zju.edu.cn/home/zhx/csmath/lib/exe/fetch.php?media=2010:ml2010-2-3-naive_bayes_classification.pdf

http://www.gfi.com/whitepapers/why-bayesian-filtering.pdf

http://www.paulgraham.com/spam.html

http://people.math.gatech.edu/~ecroot/bayesian_filtering.pdf

Comments

Leave a Reply

Blogging Calendar

November 2015
M T W T F S S
« Oct   Dec »
 1
2345678
9101112131415
16171819202122
23242526272829
30  

Archives