Skip to main content



Automatic Summarization Using PageRank

The premise of the PageRank algorithm is “prestige”. Given a set of nodes and the relationships between these nodes, PageRank provides us with a means of identifying which amongst these nodes is the most prestigious. Over time this notion of “prestige” has been adapted for many purposes, and along with that, been given many different names — “influence”, “importance”, etc.

Our interest today is in a rather niche application of PageRank which has significant consequences. We’re interested in how PageRank can be adapted to summarize text documents. That is, given a large collection of n sentences, how does one come up with an automated procedure for generating a summary of m sentences, where m < n. Naturally, these m sentences must be the most “important”.

Where is this used in the wild, you say? A few weeks ago, Flipboard published an article describing the use of auto-summarization in the generation of dynamic layouts. To put things in context, Flipboard curates articles from across the web and presents them to users in a visually appealing manner. An excerpt that accompanies the article isn’t too unreasonable a demand, but it’s a fool’s errand to think that any organization could (or should) hire copywriters to sit at computers all day long summarizing articles that flow through the system.

Let’s also take this from another perspective. As wearable devices leap to prominence (Google Glass, Apple Watch, to name a few), content must adapt to the limited screen space available on these devices. This blog post, when viewed verbatim on a two, three-inch screen on your wrist, becomes agonizing to read.

So let’s put the cart before the horse. Take a look at this paragraph:

Alaska Airlines is planning to install a unique version of the system that excludes internet access, so passengers can still stream content on routes without coverage. If your plane is connected to the web, you’ll also have access to streaming television content. Like Gogo, Global Eagle can stream content to customers on planes that aren’t connected to the web. That satellite system could also enable the company to offer live TV, like the Global Eagle service you’ll find on Southwest’s planes, but Gogo’s next-generation infrastructure is still a few years out.

This is really an automated summary of an Endgaget article found here. Is it perfect? No. If one were to read a little carefully, one would easily find oddities humans wouldn’t include in a summary. However, it’s not so bad that it’s unusable, and more importantly, although sentences don’t appear to link to one another, the information encoded in the summary is still very much valid.

The technology behind auto-summarization isn’t that new. The algorithm Flipboard describes in their article was really published in 2004, under the name LexRank. Let’s start with the observation that a document is really a lot of sentences arranged in sequence. We can take each of these sentences and represent them as a node. Now, in order for the concept of PageRank to work, we need to find some way to represent the relationship between these sentences. The naive approach is to consider the difference in characters between the two sentences (e.g. Levenshtein Distance), but that’s not going to accurately represent the relationship between the semantics of a sentence. Instead, we can use the Jaccard Index, which measures similarity by word.

To create the network, all we have to do is set a threshold for this distance, above which we accept that any two sentences are “similar”. An edge exists between similar sentences. If we run PageRank on this network (the explanation of which we will omit since it has been explained to the death on this blog), and choose the m top ranked sentences, we will have our summary.

The beauty of this approach is that it requires no additional information about the article in order to perform summarization. While advanced NLP algorithms can attempt to infer meaning and structure from the sentences in the article, all this algorithm uses is the understanding that sentences can “vote” for each other, and if I repeat myself twenty times over in twenty similar ways in a paragraph, the one sentence in that paragraph which isn’t a repeat is going to be important.

There’s a problem which the original algorithm doesn’t account for. The default algorithm for PageRank suffers from a flaw, in which a set of connected terminal nodes ends up with a high ranking. In the context of text summarization, if the threshold is increased (in order to extract the most relevant sentences), the difference in rank between say, the top two entries, and the next three entries in a five sentence summary could be very large. Depending on how information is distributed throughout the document, this could be come undesirable. The solution is to apply the scaled PageRank algorithm instead. By re-distributing the rankings over all the sentences after every PageRank update, one could possibly create a more comprehensive summary.

Another improvement can be made by applying PageRank adaptively. To a large extent, properly authored articles tend to exhibit a certain kind of structure. There’s always some kind of introduction, the main content, and finally a conclusion. The standard approach of LexRank is to consider the sentence independent of its location in the document, but if PageRank is run locally (say, scoped to a paragraph), and the best sentences are chosen from within that paragraph, then ranked again globally, we could create a summary that accounts for the structure of the document. This is also closer to how humans create summaries.

And lastly, if you’d like to play with the algorithm, a friend and I decided to implement it as a weekend project. It’s hosted on a free server, and because of that it’s not the fastest service on the planet, but have at it anyway!

Comments

Leave a Reply

Blogging Calendar

October 2014
M T W T F S S
 12345
6789101112
13141516171819
20212223242526
2728293031  

Archives