Skip to main content



TextRank: A Graph-Based NLP Algorithm

In class, we learned about PageRank, a graph-based ranking algorithm used to determine the importance of webpages on the internet. Using a similar concept, researchers at UNT were able to create TextRank, another recursive graph based ranking algorithm able to extract important sentences or keywords from a body of text. As you can imagine, this is very useful as sentence and keyword extraction can be used to summarize, classify, or index a passage of text automatically. More creative use cases might include creating visual summaries or timelines of bodies of text.

TextRank is a effective and surprisingly accurate algorithm, and although it was first introduced in 2004 by UNT researchers Rada Mihalcea and Paul Tarau, it is still very relevant today. TextRank is unsupervised and requires no training, so unlike many other natural language processing methods that might involve machine learning, it’s very easy to use. Mihalcea and Tarau discuss how the TextRank algorithm can be used for (1) keyword extraction and (2) sentence extraction, so we will look at each of these cases separately to understand how TextRank works.

Keyword Extraction

In keyword extraction with TextRank, the goal is to first rank each word in a passage based on its importance, and then return a certain number of words with the highest rankings. To do this, the first step is to create a graph. Mihalcea and Tarau let nouns and adjectives be vertices in the graph, noting intuitively that verbs, prepositions, and other parts of speech are not usually important when considering keywords. The nouns and adjectives in the graph were then connected by weighted edges based on co-occurrence, or closeness, scores between words. If two words appear within a certain number of words from each other in the passage, they will be connected in the graph with a higher edge weight the closer they are. Once the graph is constructed, the same process of recursively assigning scores to vertices used in PageRank is applied to the TextRank graph until each word’s score converges. After convergence, each word in the graph will have a score and the top scores will become keywords.

When TextRank was applied to the short passage shown below, it was able to extract keywords close to what people manually selecting keywords chose.

Sentence Extraction

Using TextRank for sentence extraction is very similar to using it for keyword extraction, with a few minor differences. The first difference is what one might expect: instead of using words as vertices, sentences are used. The second difference is not as obvious, although it does make a lot of sense. To connect vertices, edges are created using similarity scores between sentences rather than co-occurrence scores. If one sentence has many of the same words as another sentence, the two sentences will be connected by an edge. The more words in common between sentences, the more strongly they will be connected (their edge weights will be higher). After the graph is constructed, the scores of the graph’s vertices are recursively updated until convergence. The sentences with the top scores can be used to summarize the original passage.

When TextRank was applied to a collection of articles on Hurricane Gilbert, it was able to create a summary of the articles very similar to what humans created after reading the same articles.

TextRank has been a great tool for natural language processing thus far as a NLP algorithm able to extract keywords and important sentences from text by taking into consideration the whole of the text, rather than just local context. Because of the simplicity and intuition behind TextRank, I think it’ll be around for a while, not just as a baseline for keyword and sentence extraction, but also as a solid foundation that other algorithms can build off of and improve on.

References: https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf (The paper on TextRank by Mihalcea and Tarau).

Comments

Leave a Reply

Blogging Calendar

October 2018
M T W T F S S
1234567
891011121314
15161718192021
22232425262728
293031  

Archives