Skip to main content



Thomas Bayes, Clean My Inbox

http://www.paulgraham.com/better.html

This paper explains how Bayes’ Rule is applied in algorithms that attempt to catch spam in emails. Bayes Rule uses conditional probability to give the probably of event A occurring given that event B already occurred. For example, if I were to pick cards from a deck, I could ask what is the probability of seeing a red card given that a red card has already been drawn? Bayes Rule applies here because the probably that the event of drawing a red card given that a red card has already been drawn is not the same as simply drawing a red card.

However, while some events, like the one I just described, are not difficult in setting up, others are more complicated. Take spam emails for example. I could ask the question, what is the probably that a given email spam given that it has a certain word. This is represented as such:

This uses Bayes Rule to calculate a probability based on things we can observe. It is read as such: The probability of an email being spam given that it has a certain word is equal to the probability that it has the word given the email is spam times the probability of spam, divided the probability of the word given spam times probability of spam, summed with the probability of the word given not spam, times the probability of not spam.

Because we do not know the probably of spam, we have to expand the equation to the one on the right. This is used in the attached paper. Here, they use the probably of an email being spam given that it contains the word, “free,” and some phrases that contain it, such as:

Subject*Free!!!
Subject*free!!!
Subject*FREE!
Subject*Free!
Subject*free!
Subject*FREE
Subject*Free
Subject*free
FREE!!!
Free!!!
free!!!
FREE!
Free!
free!
FREE
Free
free

The article talks about how spam filters, also known as Bayesian spam filters, use this rule to measure probability they cannot measure directly, based on things that they can measure directly. While people do not know the probability of spam given a certain word, simply because they cannot measure the probability of spam, they must measure the probability of the word given that the message was spam, and likewise if it was not. This roundabout way of measuring what cannot be directly measured, through items that can be directly measured, is a highlight of Bayes theorem, and why our email inbox’s are not flooded with spam, if at least not to the degree they could be. Thanks Tom Bayes!

-Royce

Comments

Leave a Reply

Blogging Calendar

November 2012
M T W T F S S
 1234
567891011
12131415161718
19202122232425
2627282930  

Archives