Skip to main content



Naive Bayes Classifers: an extension of Bayes’ Rule

As one of the most important theorems in probability/statistics, Bayes’ Rule rears its head in various fields. Classical applications of the theorem include medical tests and spam email detection. It serves to elegantly show how probability is easily updatable, along with the events for which it is representative. 

 

This proves important in data science and machine learning. Probability, data science, and machine learning are all intertwined. We are able to make quantifiable inferences — in other words, do data science — because of statistics and probability theory. Machine learning is a natural extension of data science; a computer can create prediction models procedurally with large amounts of data, far larger than we humans could ever work with. And of course, these computers can do this because we know the math works. From image recognition to automated medical testing, the results of machine learning are clearly evident — it scales the simple arithmetic we’ve seen in our class to every facet of our lives.

 

One direct usage of Bayes’ Theorem involves the problem of classification — given a large data set of apartments, can we label a data point as having (or not having) some attribute? For instance, how can we tell apart a New York City apartment from a San Francisco apartment? What would be important attributes (ex: price, elevation, square area) to differentiate between the two types? How would we find these features?

 

Perhaps a better way to look at the question is in terms of probability. Given a certain set of attributes that we look at, what is the probability of an NYC apartment having this set? Or an SF apartment? 

 

Let’s call the event of our data fitting a certain set of features (ex: on the third floor or higher, rent >2000/month, etc.) we want to look at F and the event that our data fits as one label (e.g. “NYC apartment”) L. We can easily formulate this in Bayes’ Rule as:

 

 

This arithmetic isn’t hard in nature, of course. That’s why Bayes’ Rule is so cool — the arithmetic says a lot, but is quite computationally easy. We would be able to find P(F) and P(L) pretty easily from our dataset, but what about P(F | L)? This number won’t be accurate because we don’t know that our dataset is reflective of the entire set of NYC and SF apartments. 

 

So instead, we can simplify our calculations with one big assumption — we say each feature is independent from one another. We can see that this wouldn’t necessarily be true (ex: higher apartments might be worth more because of better views), but this assumption holds well enough. Importantly, relaxing this constraint allows us to find the probability of a label given one feature, and simply multiply these probabilities with each other to find the probability of a label given many attributes together. For instance, given attributes X1, X2, … , Xn, we can model our probability of a label L as:

(where the denominator is dropped for simplicity). 

 

We would then apply this to any possible combination of attributes we want for any given label! This is called a naive Bayes classifier. It’s a simple model that isn’t as computationally expensive as other classifiers, and yet is fairly effective. 

 

We can also see the naive Bayes classifier applied elsewhere. For instance, we can think of a large set of emails in our database, a label for “spam” emails, and a bunch of attributes that might potentially determine whether it is or is not a spam email. The same goes for medical tests and any other traditional example of Bayes’ Theorem!

 

Sources:

https://www.upgrad.com/blog/bayes-theorem-in-machine-learning/ 

https://machinelearningmastery.com/bayes-theorem-for-machine-learning/

https://scikit-learn.org/stable/modules/naive_bayes.html 

 

Comments

Leave a Reply

Blogging Calendar

November 2021
M T W T F S S
1234567
891011121314
15161718192021
22232425262728
2930  

Archives