Skip to main content



A Webpage for Every (Other) Person

The Internet as we know it is a massive sea of interconnected webpages. While there are about 7 billion people on the Earth, there are approximately 4 billion webpages online.  One company, Common Crawl, allows anyone to peruse the web systematically. On the Internet as we see it, from one webpage (such as this very blog), you can look at one webpage and then jump to another via a link. Each page will have its own assortment of subpages with their own data and information. While the Internet, or sea of nodes, is not fully connected at any given time (new webpages will not be referenced on other sites until they are discovered and a link is passed), the network of sites are connected via websearch services such as Google and Bing. The way these websites create the search engines that are crucial to anyone’s traversal of the Internet in everyday life is by a process called crawling the web. Crawlers, or computers dedicated to constantly browsing the web, look for new websites or even changes to old websites. Through these changes websearches can be made once data is collected.

 

As for the Common Crawl, this allows anyone with a computer and an Internet connection to be able to analyse the information on the web without having to dedicate time and effort into creating crawlers and storing the massive amounts of data that the Internet will provide. The Common Crawl houses its data on Amazon’s storage servers, which allows anyone access to that data. This allows anyone who wishes to make their own search engine, collect data from websites, or even statistically analyse large sets of webpages able to do so without the massive amount of time and money that building crawlers requires. Google, for example, has a translational service which was created by analysing massive amounts of text in different languages that was available online in order to offer predictions as to what one sentence might equate to in another language. The Common Crawl non-profit offers a set of data that could lead to a breakthrough in information technology.

 

 

 

http://commoncrawl.org/

http://www.technologyreview.com/news/509931/a-free-database-of-the-entire-web-may-spawn-the-next-google/

Comments

Leave a Reply

Blogging Calendar

September 2014
M T W T F S S
1234567
891011121314
15161718192021
22232425262728
2930  

Archives