Bayes’ Theorem in email spam filtering
Bayes’ Theorem describes the conditional probability an event is going to happen based on the prior knowledge of conditions related to this event. It can be used in drug testing. It can also be used to predict people’s probability of having some kind of disease according to their age or gender. However, one of the most important usages of this theorem in real life is to predict whether an email is a spam or not.
The Bayes’ Theorem is first introduced to detect junk email by Microsoft in 1998 with a system called Bayesian filter. Mainly, it makes its judgment according to the contents and the title of the email received by a user. The system was first set up by a list of potential words usually used by junk email, like “Deals”, “Secret”, and “Save”. Then it was further updated constantly by the users’ reports of spam emails. Since not every time a word in the list mentioned above appears in the email makes this email a spam, the probability of this email is a spam can also be calculated by the Bayes’ Rule. If a word appears in a user’s email a lot, then the probability of this email is a spam is relatively low, while if a word seldom appears in a user’s email, then it will have a higher probability of being a junk email.
The general formula for Bayes’ Rule is:
In this situation, the formula can be written as:
For instance, the probability of the word “FREE” appears in an email is 20%, the probability of an email being a spam is 25%, and the probability of a junk email has the word “FREE” is 45%. Then, when an email contains the word “FREE” was received by a user than the system will calculate the probability of this email is a spam according to the Bayes’ theorem is 56%. It is hard to say that this probability is very high. At the same time, the cost of classifying a legitimate email into spam is far larger than classifying a junk email into legitimate. So the system might not ignore this email. However, as the amount of data becomes larger, the accuracy will also be improved. According to another study, only when the probability is as high as 99%, they will make the decision and filter this email.
However, this formula only indicates the probability of an email being a spam based on a single word appears in the email. Many other indicators, like the domain type of the sender (.edu or .org), or whether it has an attachment or not, should also be taken into consideration in real email spam filtering.
As this graph below shows, the accuracy of taking all domain, words, and phrases into consideration is much larger than only filtering based on certain words. Therefore, the real-life situation is still far more complicated than just a theorem, but this formula did provide us with a way to explore how the email filtering system works.
link to the article: https://www.bayestheorem.net/real-life-uses-spam-filtering/
link to a study on this topic: https://pdfs.semanticscholar.org/b449/8c71651f0327b5d51c8f8008d5a1804a084a.pdf