Skip to main content

Using Bayes’ Theorem to Classify Spam

The article analyzes and discusses how spam is classified, specifically the filtering method called Bayesian filtering. When spam filters need to classify a piece of mail, it knows that this can either be spam of not spam. Using that, the filter is able to calculate the probability of whether it believes something is spam or not. With each new incoming piece of mail that is received, there are certain words within that piece of mail, and each word has a probability of either commonly occurring in spam emails or not. The filter knows the probability that a word appears in spam mails, and by using these pieces of data, is able to calculate the probability that something is spam. With this, the filter continually updates itself with each new piece of mail it receives.

This relates to the ideas from Bayes’ theorem in class, in which we learned that you can find the probability of something occurring, the the space of another event occurring. This makes it useful to classify spam, as in the certain case that a specific word or pieces of data occur in a piece of mail, you can calculate the probability that a piece of mail in that box is spam. Then, with this, you can update your predictions based on other changing values of expectation. We know that Bayes Rule states that P(A|B) = [P(B|A)*P(A)] / P(B). If A is the probability that the a random piece of mail is spam, and B is the chance that the word occurs, then P(A|B) is the goal of the spam filter, which is the probability that the mail is spam given that the word exists. If the calculation occurs for each word, and also other segments of data, such as headers and HTML code, the spam filter can accurately calculate, while also updating its beliefs each time to make itself more accurate. We know that Bayes Rule also causes false positives and false negatives. Although these do exist, the probability of either of these occurring is quite small, as the filter is quite accurate. Even if these do occur, the predictions for the next piece of mail will always reflect these possibilities. These two articles emphasize that Bayes Theorem has many unknown applications, but is often used to help create and calculate accurate guesses. When applied to spam filtering, if you are given a piece of mail, M, Bayes Rule can be applied to determine and explain whether M’s intent is fraudulent or not, which can further justify why certain pieces of mail are put into spam, and others are not.

Information Cascades and Social Media

Today’s technology has made it far more simple to learn about current events and discover new information that might not have been easily accessible years ago. Although this calls for celebration, there are issues that arise as a result of people believing everything that can be found on the internet. There are an array of news, blogs and social media sites that provide users with said information, but no way of fact checking or verifying the source. Interestingly enough, there was a study done earlier this year that looked into the rate at which false/true news spread over Twitter. Their findings were that “False news reached more people than the truth”. The researchers described the idea that false stories inflict fear or surprise while true stories demonstrated anticipation and trust.

This study is a good example of how information cascades are constantly being formed. One is able to suspect that the spreading of false news is a mere act of users sharing information based off of others responses rather than their very own opinions. It is often the case that if news were to be trending, people would not hesitate in sharing it based on the assumption that if the masses are sharing it, then it must be true. This misconception has become a prominent issue in this day and age and we will continue to see it in a political, financial, and social context.

Finding the most influential movies using PageRank and other network analysis algorithms

Researchers at the University of Turin used network analysis algorithms to determine the most influential movies. Instead of looking at box office numbers (which aren’t very good at determining how influential that movie can be in the future), the researchers looked at references within movies as a measure of success and they used those findings to also determine the most influential actors, actresses, etc.

The researchers used the movies as nodes and then measure the number of references to other movies as the connections, as well as using the influence of the movies a movie is connected to. The researchers used four centrality scores: in-degree, closeness, harmonic analysis and PageRank to assign influence scores to each movie. By doing so, they also applied this analysis to the directors of said movies as well as the actors/actresses within these movies. The top 10 most influential movies are as follows:

1. The Wizard of Oz (1939)

2. Star Wars (1977)

3. Psycho (1960)

4. King Kong (1933)

5. 2001: A Space Odyssey (1968)

6. Metropolis (1927)

7. Citizen Kane (1941)

8. The Birth of a Nation (1915)

9. Frankenstein (1931)

10. Snow White and the Seven Dwarfs (1937)

A couple insights from their research also showed that the Japanese movies filmed during the 50’s have been very influential for Western cinema. The insights also showed a gender gap where males always dominated the most influential list and actresses weren’t normally found unless the dataset was separated into their respective genders. The exception here is Sweden where actresses overwhelm actors in the global rankings.

Baseball 3-0 Counts

In baseball, the count is the ratio of balls to strikes against the batter. Depending on what this ratio is, can effect many different aspects of the game. When the count is 0-2, for example, the batter must be very cautious and less selective in what they will swing at because one more strike and they will be out. On the other hand, when the count is 3-0, the power is in the hands of the batter and they can be as selective as they want, at least for a pitch or two, because they are not at risk of striking out. Because of the many different strategies that go into what pitch to throw in certain situations, baseball uses many concepts of game theory.

A 3-0 count brings many concepts of game theory and many different opinions for what both the pitcher and batter should do. Many believe that you should never swing at a 3-0 pitch as you should challenge the pitcher to be able to throw a strike. Others believe that you should be looking to swing as the pitcher is likely going to throw a very attractive and hittable pitch that you may not be seeing again. On the other hand, some believe that the pitcher should focus on throwing a very hittable pitch as the batter is likely not going to swing. Whereas others believe you should throw something that may not be as hittable, such as a curveball or change-up, because the hitter will be looking to swing.

In this article, out of 1,012 opportunities of a 3-0 count in the season of 2014, players swung at 3-0 pitches an average of 8.99% of the time and a median of 5.76% of the time. While some assumptions can be made about this data, game theory is unable to explain 100% of occurrences in any situation. There are many different factors that need to be looked at when deciding what pitch the pitcher should throw such as how likely the batter is going to swing, how skilled the hitter is, how many outs there are in the inning, and how many runners are on base. There are many different scenarios that would change the outcome of the pitch thrown and if the ball is swung at or not. Game theory overall will continue to be a large part of baseball.

Google suggests that combining pages could make your site rank better

In this article, John Mueller, Senior Websmaster Trends Analyst at Google,  suggests that you should combine weaker, smaller web pages into a single page in order to increase the rank of your site in google’s algorithm.  His explanation for this is that if you have one page with more information as opposed to that information being spread out amongst different pages on the site, then most likely there would be a lot of links on your site all pointing to that one page with a lot of information. Since that page would have more links pointing to it (even though these links are from within your own website), it would have more Authority than the smaller pages would and therefore would be ranked higher.

This article relates very well to work we did in class regarding hubs and authorities , pagerank, and the overall ranking of webpages by a search engine. We learned that by the Authority update rule the authority of a page is equal to the sum of the authority scores of all the sites that point to the page. It therefore is intuitive that if a website has more pages pointing to it, it will have a higher authority score and therefore will score better in google’s algorithm.

How do trends start?

We are constantly changing how we view the world as well as picking up on new trends as the years go by. What used to be popular a few years ago would now be criticized with new opinions due to how different the world is now compared to how it was in the past. However, how do all of these trends begin? There are many opinions and ideas that can be found on the internet, but it seems that only a few can reach the surface of popularity where the whole world would get to hear or read about them. Is there a common trait from all the trends that have ever existed so that we can understand how and why they began.

According to, for a trend to begin, it must be accepted by a distinct group of people which include celebrities, artists, young people, designers, wealthy people, and gay men. Once it gets accepted by these groups of people, also known as trendsetters, it must then go into the process of getting picked up by larger groups of people through the use of social media, magazines, the internet, etc. Depending on what the trend is, most can typically last around a few years before people’s taste or style change.

In class, we have learned about Network Diffusions and how even one node can cause a difference in a graph. This node can represent the trendsetter that can potentially get a large percentage of the world population to either join along or even just hear about this new idea.

A defective auditing market that makes ‘lemons’ of us all

This article covers the market for audit services and how it encompasses the “lemon” problem. A report from the International Forum of Independent Audit Regulators showed evidence that there were serious problems with 40% of the audits inspected that year. With most markets, if 40% of the goods sampled are defective, it usually isn’t a well-functioning market. The audit market, however, is different because customers(companies) must be audited(obligated to buy the product). There is also an intermediary who chooses the product for the customer and is not the company itself.

The article notes that there is not much of an advantage to compete on audit quality because that could expose vulnerabilities in a company. Big auditing companies such as KPMG and Deloitte have evaded liability by creating generic assessments of their clients’ financial standings. This is unlike the lemon model we learned in class because the client companies(buyers) of auditing services actually want the “lemon,” if it means it will be beneficial to the company. This allows the “lemon” auditing companies to be very successful, but breaking these large audit firms up will not fix the lemon problem. Buyers would have to demand higher standards for audits as well as be ready to pay for them. Auditors would then have to decide what their audits should really show about a company. In contrast to what we learned in class, sellers would want to keep selling “lemons” to buyers, and these lemons actually are the higher quality goods(even though they are in actuality a lower quality good).

Interacting Agents and Stock Market Crashes

While traditional economics are contingent upon individual actors each acting to maximize their own utility, collectively producing aggregate trends, actual aggregate behavior is distinctly based upon the interactions of individuals, producing aggregate trends that are not reflections of single-actor utility maximization.

These behaviors can be analyzed to show the difference between “weak” neighborhood interactions and “strong” neighborhood interactions. In both situations, agents buying and selling a single asset interact with an average opinion of the market (to reflect public opinion). The opinion of the asset starts high at the beginning of the model, and the value of the asset is tweaked to perturb slightly downwards. Weak neighborhood interactions (if each agent is acting in their own interest) creates a smooth curve downwards, as each agent sells when the value of the asset drops below their evenly-distributed thresholds. This smooth behavior is rare in the actual stock market, where neighborhood interactions are strong.

The strong neighborhood interaction creates very sudden, non-smooth drops that is similar to how stocks crash. When fitted with actual stock market data, the strong neighborhood interaction model shows a good fit.


Degrees of Separation and Business


This article talks about the ways in how kindness can positively impact a business, specifically how it should be at the foundation of the overall brand of that business. The idea is that when kindness is employed in all aspects of the business it can collectively create an atmosphere where the workers put in the most effort and customers feel happy to give this company their business. The article then goes on to discuss the reach that these efforts can have using the principle of six degrees of separation. This principle arises because of the idea that we are more likely to things for other people so long as they do things for us. This means that spreading kindness can create a web of people who will view this business positively. The article ends with 5 key ideas for one to take away which are: exceed expectations, do pro-bono work, make referrals, treat your employees well, and donate a portion of profits to charity.


The idea using the principle of six degrees of separation is very interesting in how it is applied in this article. The application of this principle is that businesses that focus on kindness will increase their network of people. This makes sense as we can see the rise of a web of people due to word of mouth or other similar methods of spreading information. Since people are fairly interconnected in society, it creates a system where the best connections are found to be rooted in kindness. Therefore, by employing this in a business setting and structure, that business can reap positive rewards.

Why Third Parties Can’t Rise in the U.S.

We’ve been talking in class about how there is no perfect voting system that convert individual preferences into a community-wide ranking. Naturally, one might start to think about the voting system present in the United States, and how that has impacts on our representation and preferences. Specifically, how the voting system we have in place right now does not allow for the rise of third parties. We’re at a time in our political history where people want more choices for who they elect to office (61% of the American population according to Gallop in 2013). Even in this most recent election season, candidates Gary Johnson of the Libertarian Party and Jill Stein of the Green Party failed to gain much traction despite third party candidates being more prominent in this election than in 2014.


One major reason why this is the case is because voters feel that voting for a third party throws away their vote. For example, if you’re socially liberal but fiscally conservative, you might vote for a Libertarian candidate. But if you would prefer a Democrat candidate over a Republican one, then you run the risk of the candidate you would not prefer winning. In other words, you’ll be harming your 2nd preference candidate because you’re only able to vote for 1 person. And this isn’t just a hypothetical, however. In 2000, Al Gore lost Florida to George Bush by only 537 votes. Many argued that had Ralph Nader not run his own candidacy on the Green Party, Al Gore would have picked up Florida and hence the Presidency. Such a situation is bound to result again the way current voting works, since even if you would rather have a third party candidate over a Republican or Democrat, you’ll be not contributing a vote to your 2nd preferred candidate that could win instead. In the 1992 Presidential Election, Clinton received 44 million votes to George H.W. Bush’s 39 million votes and Ross Perot’s 22 million votes. Although you can never be sure, who knows if George H. W. Bush would have won had Perot not ran? There’s no way to tell since our voting system doesn’t reflect preferences other than your first choice.

Another reason is the way voting has been modernized. Minority parties were able to rise to prominence in the 19th century, namely the rise of the Republican Party with the election of Abraham Lincoln. But back then, there weren’t such legal barriers as there are today. Candidates didn’t need to petition to be on a ballot (Evan McMullin who ran as an Independent in the 2016 election for example was only able to get on the ballot in 11 states due to time) and they could even nominate candidates already nominated by another party. This is important because this provides legitimacy to a third party (since you’re nominating candidates of more established parties but also promoting your own to some extent).


keep looking »

Blogging Calendar

December 2018
« Nov