The Deep Web
The Internet is like an iceberg. What we view and interact with daily is just the tip. There exists a large amount of the Internet that remains unseen and unsearchable by traditional means. 10 years ago, estimates put the size of this content at 500 times larger than the public web! This “invisible” Internet is called the Deep Web. Technically speaking, the Deep Web is every part of the Internet that is not indexed by search engines (and therefore is not “searchable”). The pages in the Deep Web are not merely the webpages not included in the “Strongly Connected Component” of the Internet; the Deep Web contains all sites and pages without any hyperlinks leading to or from the page. This can range from dynamic content (like search queries), private sites (that require a password), and media formats not handled by search engine indices. Elements of the Deep Web are only accessible by knowing their direct address. In the early years of the Internet, this was the only method of accessing webpages. Nowadays, search engines attempt to use automated processes to index (and therefore make “searchable”) websites.
There are multiple ways that search engines like Google, Yahoo!, and Bing locate and index sites. Google happens to have done extensive work in attempting to reach articles in the deep web. They created a Sitemaps protocol in 2005 that allows webmasters to publish a catalog of URLs to search engines for indexing. These Sitemaps allow for tags to be associated with each URL so that search engines know how often to update indexes and how to rank them for searches (such as Google’s PageRank algorithm).
Search engines do not just rely upon protocols and standards to access URLs in the Internet, they largely depend upon automated programs called web crawlers. A crawler is a robot that searches through the links of a website to categorize and archive web sites for site indexing. Some, like Google’s aptly named “Googlebot,” follows all the links on a webpage, follows the links on those webpages, and continues to search the HTML code for URLs to add to the search engine’s index. The problem with this method of discovering sites is that they must originally be linked from an originating site, and most of the deep web exists unlinked.
While most of the content on the Deep Web is relegated to unlinked webpages and private sites, there exists a portion of the Internet that is purposefully hidden. An anonymous network, like Tor, uses networks of computers to hide the IP addresses of the users accessing “hidden” sites. This allows for highly anonymous activity, which tends to lead to illicit activity. There are other anonymizing programs such as Freenet, which is mainly used by users in countries that oppress freedom of speech. The Deep Web, however, takes advantage of its relative obscurity to facilitate questionable behavior.
While the Deep Web can be difficult to explore (sometimes purposefully so), the lines between the Surface Web and Deep Web have begun to blur. Search engines are trying to increase the range of the Internet they have access to. Googlebot is apparently showing ability to run scripts on sites (allowing for even more indexing). Regardless, we probably have interacted with elements of the Deep Web and didn’t even know it, yet there still exists more undiscovered.