Bayes Rule, Machine Learning, & Fighting Spam : Networks Course blog for INFO 2040/CS 2850/Econ 2040/SOC 2090

Bayes Rule, Machine Learning, & Fighting Spam

Bayes Theorem, one of the most fundamental concepts in probability theory, has had found its way into applications in an incredible variety of fields. One of the most well-known applications of Bayes Theorem is a family of machine learning techniques named Bayesian Classifiers, or Naïve Bayes Classifiers, that are used in to classify entities. A very common driving example of this technique is the classification of an email as spam by spam filters. Even though this method is based on the relatively straightforward Bayes theorem, it is effective enough that it is still used, in some form, by modern spam filters even today.

Say we have an email that consists of $N$ words, $F_1, \ldots, F_N$ . We first define the probability $P(F_i)$ as the probability word $F_i$ appears in any given email. These words are often called “features” in the context of Bayesian Classifiers, as they are descriptive of the entity we’re classifying. Next, we define a random variable $S$ that equals 1 when an email is spam and 0 when it’s not. As our goal is to determine if the email is spam or not, we would like to determine if the following probability is higher when $S = 1$ or $S = 0$ :

$P(S|F_1, \ldots, F_N)$

This formula is the conditional probability of an email being spam if it consists of the words $F_1, \ldots, F_N$ . In machine learning tasks like these, we often have a set of documents called a “training set” (in this case a set of emails) that are pre-labeled (in this case marked as spam or not), which we use to “train” our algorithm. In this case, we could go through all the emails in the training set, find ones that consist of the words $F_1, \ldots, F_N$ , and then calculate the percentage of them that are labeled as spam. This would be our best estimate of the probability above. However, this approach isn’t feasible – since emails are generally relatively long and unique, we wouldn’t be able to find enough emails consisting of the words to calculate a good estimate of the probability. In most cases we wouldn’t be able to find any matching emails in our training set at all, as the odds of identical emails being sent multiple times are very low.

This is where Baye’s Theorem comes in. Using it, we can rewrite the probability above as:

$\frac{P(S) P(F_1, \ldots, F_N|S)}{P(F_1, \ldots, F_N)}$

In practice, we’re interested in whether the probability above is higher for $S = 1$ or $S = 0$ . Therefore, since the denominator of the fraction above doesn’t contain S, and is therefore identical for each case, we can ignore it. The only remaining portion is the numerator; it is the product of $P(S)$ , the probability that an email is spam (assuming we’re considering the case of S = 1), and $P(F_1, \ldots, F_n|S)$ , the conditional probability that a spam email consists of the words $F_1, \ldots, F_N$ . Now, however, we run into the same issue as before: we won’t be able to find enough emails consisting of the words $F_1, \ldots, F_N$ to calculate a meaningful probability for the second term in the numerator. However, if we assume that the probability of a word appearing in an email, $P(F_i)$ , is independent of the other words in the email, probability theory allows us to simplify the equation to the following:

$P(S) \Pi_{i = 1}^N P(F_i|S)$

Obviously this assumption isn’t true; if we have the word “woke” in an email, the probability of the word “up” also appearing is higher than average. The fact that the assumption is false is why this method is called a “naïve” classifier. However, despite the assumption not always holding, this method still yields useful results.

The new form we have arrived at finally contains probabilities that we can very easily train using our training set, assuming we have a large set of emails to train it with. We first consider training the algorithm to determine the probability that the email is spam. In this case, the probability $P(S = 1)$ is simply the probability that any random email is spam; we can estimate this by finding the percentage of emails in our training data set that are labeled spam. The second term is the product of the probability $P(F_i|S = 1)$ for each of the words in the email in question. We can estimate the value for each term by finding the percentage of spam emails in our training data that contain the word $F_i$ . We then perform the same process for $S = 0$ . Once we have the values needed by the formula, we can use the formula to calculate the probability for both cases of S, and decide whether the email is spam or not based on which has the higher probability.

There are a few issues that need to be accounted for – such as the fact that the probabilities can become too low for computers to accurately handle, or the fact that a term that doesn’t appear in the training data will have an estimated probability of 0 for all cases, and therefore ruin any calculation. However, these issues are easy to fix in practice. In just a few paragraphs, we were able to take a simple theorem from class and slightly modify it to create a method for fighting spam that’s still in use by commercial spam filters today!

Sources

http://khuong.vn/Papers/SpamFilter.pdf
https://web.stanford.edu/class/cs124/lec/naivebayes.pdf
http://en.wikipedia.org/wiki/Naive_Bayes_classifier

November 4, 2014 | category: Uncategorized

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Networks

Bayes Rule, Machine Learning, & Fighting Spam

Comments

Leave a Reply

Blogging Calendar

Archives