A Look at Wikipedia’s Ontology and its Underlying Connectivity
Tracking data and information in Wikipedia articles has been a popular topic as long as the website has been around–and, so long as the site continues to grow, the field will exist. Wikipedia even contains an entire sub-community within its site for referencing its own articles, dedicated to analyzing the information supported [1,2]. In the frequent, multifaceted endeavors to ‘visualize’ Wikipedia, many have turned to a network-based approach. One such study chose to consider how Wikipedia articles may function under a hierarchical structure [3]. By treating articles as nodes in a massive network graph, the study tackles article organization by grouping articles into ‘topics’. This idea, dubbed ‘Wikipedia Hierarchical Ontology’ or WHO by the study, is created by assigning weights to terms in an article based on its relevance to the concept and comparing articles with high weights for similar words. Using these weights, edges can be formed between articles to build up a more abstract topic, which contains several article nodes. From this stage, topics are connected to other topics that fall under the same umbrella, which ultimately establishes a hierarchical ontology from the given dataset.
Through usage of this method, the study builds up a network with clusters, within which exist nodes that share high-weighted edges. Nodes contained in a concept likely have a high degree with other nearby nodes due to them being under the umbrella of the same topic. Thus, it is conceivable that strong triadic closure would exist within such clusters, yet not between clusters (if considering edges to exist only above a certain weight threshold). The identification of local bridges is key to discern articles that fall under the same concept and those that are simply branching between concepts, with the latter being the case if a local bridge is found.
It is difficult to imagine a network of Wikipedia articles being disconnected, since articles trend towards broader, more encompassing topics as the subject is abstracted [4, 5]. However, the definition of connectivity comes into question when considering a broad network, since defining an edge between articles can be unclear. By using term/concept similarity to define relations, the study creates a value which can be used to determine the presence of an edge. However, since any two articles will always have a value likely above zero, a threshold for the magnitude of similarity is critical in order to establish an edge. Even if edges are weighted, simply making an edge for every value adds too much noise to the network for any hierarchy to be revealed. Through this solution, the threshold value becomes paramount–choose a threshold too high and the graph can become disconnected, and choose a threshold too low and the every article becomes connected. The selection of a preferred threshold, however, can be unclear before first defining what an ideal concept hierarchy should look like. Observing parameters of the network and tuning them appropriately would provide a means of discerning a proper hierarchy, but such a process can get complicated in identifying such parameters. The study’s approach of simply grouping articles by concept and then connecting the concepts offers a roundabout solution, albeit at the cost of a noisier solution.
— Nikhil Saggi
Referenced works:
[1] https://en.wikipedia.org/wiki/Wikipedia:Researching_Wikipedia
[2] https://en.wikipedia.org/wiki/Wikipedia:Statistics
[3] https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6310552
[4] https://xefer.com/WIKIPEDIA
[5] https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy