Skip to main content



All Things are Relative on the Internet

Last week, I offered to explain how the Google Collection relates to the Index (or, rather, vise-verse) and Karen took me up on it.

First, I’m going to refer you to last week’s post where I gave definitions for the terms “collection” and “index”. I know that they seem quite simplistic words, but I can assure you that there has been much confusion and miscommunication surrounding them during the course of this project.

Briefly, then:

Collection = all pages that Google can find in our domain (except the ones we told it not to find).

Index = the (approximately) 5 million pages that are searched and returned when a user enters a query.

So. The Index is a sub-set of the Collection–that’s pretty easy.

Here’s the $64,000 dollar question:

How does Google figure out what’s in the Index?

The short answer is “rank”. Google ranks every page in the Collection and then sifts the top 5 million of them into the Index.

Google is constantly monitoring both the Collection and the Index and will bump sites in either direction (in or out of the Index) as their rank dictates.

Were this late-night tv of yore, now would be the point where I would don a turban and hold an envelope up to my forehead in an attempt to divine the future.

The answer is: Proprietary information!

The question was: How does Google determine rank?

Damn, I’m good.

Right. So Google is very tetchy about sharing the details of its ranking system (if you are interested in the technical specifications on the Google page ranking algorithm, visit http://www.google.com/corporate/tech.html or http://www.whitelines.nl/html/google-page-rank.html), but it basically boils down to two things:

1. How many pages link to you.
2. The rank of those pages.

There’s more to it than that, of course, and you can find out what you can do to help improve your rank at: http://web.cornell.edu/resources/google_help/rank.html.

Can we control what’s in the Index?

No, not directly. Not to the best of my knowledge, anyway. And, I can assure you, this is exceedingly frustrating for me.

To some extent, we are “controlling” it as we fine-tune the Collection by dropping out page hogs like databases and session ids, but that’s really all we can do. You can help by tuning your site to boost your ranking, but we are largely at the mercy of the Google Collective.

Resistance is Futile,

Lisa

Comments

2 Responses to “ All Things are Relative on the Internet ”

  • Will

    You can map specific search terms to specific urls, right? (I think they call it keyword matching or something like that). Like if I enter “physics” you can preset the Google appliance to make http://www.physics.cornell.edu/ show up first in the rankings.

  • anonymous coward

    Why not use the “Sponsored Links” function and let administrative departments pay for priority in the rankings? Even better, we could charge each department a participation fee plus a Google Usage Based Billing (or “GUBB”) fee for each of its pages indexed by the Google Appliance (beyond a base of, say, 500). There could also be a charge every time a department’s page appeared in a list of search results and then an additional fee for every click-through. How ’bout it?