Skip to main content



Using PageRank to detect phishing websites

Phishing is the act of attempting to acquire protected or personal information such as usernames, passwords and credit card details by masquerading as a trustworthy entity. Reports show that “at an absolute minimum, that [at least] $320 million is lost annually due to phishing scams”.[1]

A recent paper [2] has been published that proposes a PageRank based detection technique for phishing. Specifically, the authors use Google’s PageRank value as a heuristic to evaulate whether a site is phished, since a site’s PageRank value is robust and updated frequently. The paper also discusses an implementation that accurately classifies phished sites by means of taking into account features other than PageRank value such as age of the domain, suspiciousness of the URL, and whether or not the domain contains an IP address or not and whether the site takes personal information about the user as input.

In the second section of the paper the author’s classify the five main categories of solutions to phishing which include:

  1. Blacklisting: a url is compared against a “blacklist” of urls which have been reported as phishing sites.
  2. Heuristics: identifies patterns in the url corresponding to suspicious sites. Performs better than blacklisting.
  3. Machine Learning: Highly effective at identifying phishing sites but requires large white-lists to prevent false positives.
  4. Trusted communication: method which involves a process to identify legitimate sites and create a white-list. However, this method can only identify a small fraction of the total number of legitimate sites.
  5. Hybrid: this method combines several of the features listed above to identify phishing sites.

The authors developed a content-based approach to detect phishing called CANTINA which considers a page’s Google Page Rank value, (also called the Google Toolbar Rank, GTR) as an important heuristic. Although the algorithm to calculate a site’s GTR is similar to the one discussed in class with one addition: a blacklisted page receives a GTR of -1. GTR evaluation is a vital element in the detection of phishing sites because the PageRank value of a site will be high for a legitimate site and lower for a phishing site.

In their conclusion, the author’s report that the false positives could be minimized and the true positives could be maximized by including the PageRank of a site as a heuristic.

1. http://people.seas.harvard.edu/~tmoore/infosec-phishing.pdf
2. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6222667

 

-vtn6

Comments

Leave a Reply

Blogging Calendar

October 2012
M T W T F S S
1234567
891011121314
15161718192021
22232425262728
293031  

Archives