Skip to main content



Analysis of Wikipedia to Measure Semantic Relatedness

Due to the organic way Wikipedia is continuously grown and modified by a vast number of contributors, the online encyclopedia is uniquely representative of human knowledge. The resource is valuable not only in content, but also its structure. Wikipedia is an enormous network. Each page is a node and the links to and from each page are directional edges. There are inline links, which are stronger ties, as well as links at the bottom of each page, which are weaker ties. The complexity of the network is great—average Wikipedia page has 34 links into the page and 34 links out of the page. Because each page is representative of a word or idea, this network has great potential to be used in machine learning and artificial intelligence.

One application of this is in semantic relatedness, which could be used in automated natural language processing and aiding computers in reasoning about language. In the paper linked below, the hyper-linked structure of Wikipedia is used to create a measure of semantic relatedness. To measure the semantic relatedness between two terms, simply consider the number of links that the terms have in common. For example, “automobile” and “global warming” share “fossil fuel”, “20th Century”, “carbon dioxide”, and “air pollution”. Not all articles are words or phrases, but terms can be found by looking at anchors. Anchors are the inline links to pages to relevant terms or phrases. Because Wikipedia dictates that relevant terms must be linked, anchors are very useful in overcoming two main challenges of language processing and determining relatedness, polysemy and synonymy.

Polysemy is when a single term had multiple possible meanings. Consider the word “plane.” This could be referring to an airplane, the mathematical surface created by 3 non linear points, or a wood-working tool. These three concepts have very different relatedness to other words. For example “flight” is very related to airplanes, but not related to the infinite surface or the tool. Synonymy occurs when multiple terms have the same meaning, for instance, plane, airplane and airplane. Polysemy and Synonymy are both easily addressed using the structure of Wikipedia links. In the case of polysemy, different meanings of a term have different pages. The anchors in a particular page will serve as context in distinguishing which is the desired meaning of the word. Additionally, the most common meaning of a term can be chosen by looking at the number of links to that meaning’s page. The more popular a usage is, the more links will point to that page. Synonyms are easily determined since anchors with the same meaning all point to one page.

Overall, semantic relatedness measures described in the paper did fairly well, especially considering the simplicity of the algorithms they used. I’m very interested in other ways in which information could be obtained from the network structure of Wikipedia.

http://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-005.pdf

Comments

Leave a Reply

Blogging Calendar

September 2014
M T W T F S S
1234567
891011121314
15161718192021
22232425262728
2930  

Archives