## Classifying Movie Reviews using Bayes

Original Article: https://www.dataquest.io/blog/naive-bayes-movies/

This article describes an algorithm for classifying movie reviews as either positive or negative reviews. Given a training set of text reviews that have been marked as positive or negative, the ‘Naive Bayes’ classification algorithm can produce a set of rules to predict the positivity/negativity of other reviews. There might not be a clear use of this, but it could be applied in consumer review sites to infer ratings from a writeup. The article relates to ‘sentiment analysis’, the study of ‘subjective information’ (emotions, feelings) within a piece.

Naive Bayes is a simple way of classifying items using Bayes Theorem. With respect to reviews, the two events are whether the review is positive or negative (P(pos)), and whether a certain word occurs within the review (P(word)). In the training set, we can compute the probability of positive and negative reviews (P(pos) and P(~pos)), the probability of words occuring (P(word) for all words), and also the probabilities of words occuring given that the review is either positive or negative (P(word | pos) or P(word | ~pos)). This is all the necessary information to predict and classify other reviews by computing the probability of begin positive or negative given that a certain word occurs within the review (P(pos | word) = P(word | pos) * P(pos) / P(word)). For every word in the review, such a probability can be calculated. An average of all words can be taken by weighting each word by its importance (e.g. the word “the”, although high frequency, is not very important, and probably has P(pos | word) of 50%). This final value decides the positivity or negativity within a review.