Skip to main content



Google Can’t Search the Deep Web, So How Do Deep Web Search Engines Work?

Link: https://blog.torproject.org/tor-heart-ahmia-project

Introduction:

If search engines like Google, Yahoo, and Bing are unable to index the deep web, then how do deep web search engines work? We’ll try to answer this question by first what I mean by “deep web”, then explaining why Google can’t crawl the deep web, and finally looking at how some popular search engines like Ahmia and the Uncensored Hidden Wiki “search” the deep web.

Deep Web:

According to Google’s online dictionary, the deep web is “the part of the World Wide Web that is not discoverable by means of standard search engines, including password-protected or dynamic pages and encrypted networks”. It is estimated that search engines like Google index only 4% of the entire world wide web, meaning that the deep web is nearly 25 times larger than the internet you and I have used our whole lives. Note: the deep web shouldn’t be confused with the “dark web”, which pertains strictly to pages containing illegal content such as child pornography, terrorist forums, and illegal auctions/transactions.

Google Can’t Crawl the Deep Web:

Google’s search engine functions by using “crawlers”. (1) These crawlers start from a list of known web addresses, visit those pages, then follow the links contained on those pages, and continue following links found on the new pages, collecting information about each page as they go. Now, consider a single page in the deep web. Google’s search engine could be unable to find this page because of several reasons. For one, Google’s crawlers might never come across this page simply because no other previously crawled page links to it. Additionally, this page might require some sort of authentication such as filling out a search form and clicking submit, or having a certain certificate. Also, if a page contains illegal content, Google will likely not want that content appearing in search results, so they won’t index it. Finally, if the creator of a page doesn’t want it to be indexed by popular search engines, they can include a suitable robots.txt file, which tells the crawlers not to index the page. If the crawlers index the page anyways, then legal action can be taken against the creator of the crawlers, and the search engine can end up on a bot reporting site like http://www.botreports.com/badbots/ (2).

“Deep Web Search Engines”:

There are many “deep web search engines”, but I’ll focus on two: Ahmia, and the Uncensored Hidden Wiki.

Ahmia was developed by Juha Nurmi as part of the Tor Project, and it is one of the closest things to a deep web search engine (3). Ahmia essentially collects .onion URLs from the Tor network, then feeds these pages to their index provided that they don’t contain a robots.txt file saying not to index them (4). Additionally, Ahmia allows onion service operators to register their own URLs, enabling them to be found. Through continuously collecting .onion URLs, Ahmia has created one of the largest indexes of the deep web. That being said, it still comes nowhere near to scratching the surface of the whole deep web, but it indexes a good portion of the content that most people would want to look for.

The Uncensored Hidden Wiki operates a little differently. Anyone can register on the Uncensored Hidden Wiki, and after that, anyone can edit the links contained in the database. The search engine operates by searching the provided descriptions of the pages at these links. This certainly has its pros and cons. On the bright side, crowd-sourcing the links is one of the best ways to collect a large number of useful URLs, and keep them up to date (especially since .onion domain names change extremely often). On the other hand, anyone can change the links to wherever they want, or alter the descriptions of the links. The negatives of this can be mitigated by site admins to ensure that the links are usually accurate, but there are no guarantees when using the links on this page. Additionally, the Uncensored Hidden Wiki has its name for a reason, as the content of that page is certainly uncensored.

Conclusion:

While the “deep web search engines” mentioned above are capable of indexing a good part of the deep web, the vast majority of it remains unindexed, and no search engine is capable of finding everything contained in it. The best deep web search engines function in various ways, whether it be crowd-sourcing URLs and page descriptions or continuously collecting them, but they certainly do not function in similar ways to traditional search engines such as Google. If you want to learn more about the deep web, you can find plenty of information about the deep web using Google (how ironic). If you want to search the deep web yourself, here’s my advice: don’t. Especially if you’ve never heard of terms like .onion, Tor gateways, proxies, botnets, Trojans, etc. If you’re anything like me then you have no business searching the deep web, as it can be dangerous if you’re not extremely careful protecting your identity, even when using the search engines mentioned above in conjunction with the Tor browser. The deep web also has very little that you or I would find interesting, and plenty of things that neither you nor I want to see.

Sources:

  1. https://www.google.com/search/howsearchworks/crawling-indexing/
  2. http://www.botreports.com/badbots/
  3. https://blog.torproject.org/tor-heart-ahmia-project
  4. https://ahmia.fi/documentation/indexing/

Comments

Leave a Reply

Blogging Calendar

October 2017
M T W T F S S
« Sep   Nov »
 1
2345678
9101112131415
16171819202122
23242526272829
3031  

Archives