Using PageRank to Detect Fraud in Healthcare
Link: http://hortonworks.com/blog/using-pagerank-detect-anomalies-fraud-healthcare/
PageRank is most widely known for showing being an integral component to the success of Google’s search engine. However, PageRank is also widely used in other fields with varying use cases. PageRank is not limited to web pages; instead, in the general case, PageRank is used to find “important” or “relevant” nodes in a graph. By understanding this, PageRank can be applied to a diverse set of use cases that do not include web search.
One example is shown in the Hortonworks blog linked above. In this blog, Hortonworks covers how they used the personalized PageRank algorithm to detect healthcare fraud. What follows is a description of how Hortonworks used the idea of PageRank to detect healthcare fraud.
Hortonworks used a simple idea for fraud detection: if a certain doctor has a high PageRank in a field that they are not tagged in, there is a higher chance of fraud. An example of this would be an internal medicine doctor having a high personalized PageRank in the plastic surgery category.
In order to execute their idea, they needed to have a notion connection and also a notion of tagging doctors and computing a page rank for each separate tag. They were able to tag doctors to certain specialties from the dataset provided. They used these tags to then generate a similarity between doctors; doctors with similar tags had a connection. From there they implemented the personalized PageRank.
Normally, PageRank all nodes start off with equal value. This is not the case with the personalized PageRank algorithm. In this graph, there are clusters of tagged doctors grouped together (Example groups: Dermatologists, phycologists, and plastic surgeons). If the goal is to compute a personalized PageRank for each group then each group should be emphasized when it is its respective PageRank computation. For example, if we are calculating the PageRank for plastic surgeons then all plastic surgeons should be emphasized. Thus, they added an extra weight to the group which the PageRank was being calculated. The final output contained N different PageRanks for each N tag. After, they analyzed each PageRank and searched for a node with a high PageRank with a non-matching tag and marked them as fraud.