Skip to main content



Network Effects in Search Engine Feedback Data

“Search Engines that Learn from Implicit Feedback”  is a summary of research done by Thorsten Joachims and Filip Radlinski at Cornell University on the effect of user behavior on search results.  Implicit feedback uses the actions of the user to modify the order and contents of search results.  Search algorithms like PageRank don’t always deliver the desired results, and if users prefer a different page, search engines can take advantage of behavioral data to provide better results to future users.

For example, if a user submits a query to Google and chooses the 4th result, that particular result has a higher utility for that user.  If multiple users choose the same result, there’s a good chance that the 4th result is actually the best result on the page.  Joachims and Radlinski mention many more types of data collection, such as query chaining.

Suppose a user makes a query, and then doesn’t click on any of the results.  This is called abandonment.  Query chaining is when the user then re-formulates their query into something that is more precise or slightly different.  This could be adding words, changing a search term to a synonym, or completely changing the query.  The user was not satisfied with the first result page, and the sequence of queries he used to find the best result is called a query chain.  When a user eventually clicks on a result, they were probably looking for that result when they submitted the first query.  The search engine would be more efficient if this result was put on the first page, since only one query would have to be submitted.  If there are multiple users that make the same initial query and click the same eventual result, that result is likely a good candidate for the first query.  Joachims and Radlinski use the Oxford English Dictionary as an example.  A user searched “oed,” and when the correct result didn’t pop up, he then searched “Oxford English Dictionary.” This later query was more precise and gave him the correct result, but the search engine could have saved time by putting the dictionary result on the first page.

Joachims and Radlinski didn’t just theorize on this type of data collection, but actually used it to create a “metasearch engine,” where the search engine kept track of user clicks and query chains to improve results from search engines like Google and MSN.  They created a baseline function that assigned values to certain pages, and it was adjusted based on a large number of training queries.  Joachims and Radlinski tried their new engine in two places, at  the Artificial Intelligence unit at University at Dortmund, and in the library system at Cornell University.  Users at the University at Dormund were aware that the search engine had been modified, but those at Cornell libraries did not.  Once the search engine “learned” how to develop better results, both groups of users preferred the improved metasearch engine.

If implemented on a large scale, capturing and using implicit feedback data improves search engine results significantly for groups of users.  Although PageRank is a popular and well-studied search algorithm, quite a few improvements have been made to the basic model. The Google search algorithm has been improved not only to provide more relevant search results, but also to thwart spammers and those looking to manipulate page rankings.  Taking advantage of implicit data like query chains and click-throughs considers popularity from a human standpoint instead of hyperlink or node popularity.

The advantages of implicit feedback data could also be analyzed form a network effect standpoint.  When choosing what search results to display, a search engine benefits from “following the crowd.”  Thousands of users may believe a certain result is best for the given query. However, the PageRank data and hyperlinking structure may lead to a different best result. Obviously the search engine is meant to cater to it’s users, and therefore should change its results based on the decisions of the majority.  This may upset some companies that have high PageRanks and pay advertising fees, but Google also has to remain competitive as a search engine.  If users are dissatisfied with search results, they’ll choose other sites.   We can think of these feedback techniques as the search engine ignoring it’s signal and following the group, since the group knows something the search engine doesn’t. This is the foundation of an information-based network effect.

Other network effects can actually counteract the advantages of implicit feedback data.  If a majority of users click a certain result from the same query, it is generally safe for the search engine to “follow the crowd.”  But a natural question results from this idea: what if the “crowd” is wrong?  According to research by Joachims and Radlinski, users view the top search engine result first, and then view the rest below in order to the bottom.  This is a pretty natural assumption, but it is also verified by eye motion-tracking research done by Joachims and Radlinski.  So are these top results more likely to be clicked on?  If the user thinks most results are similar, they’re likely to click the first result out of convenience.  This is made even more probable by the Google “I’m feeling lucky” feature, which automatically chooses the first result, as well as the browser bar AJAX results that are displayed as a part of the Chrome search bar.  In newer editions of Chrome, results for search terms and select URLs are displayed below the browser bar once a query is typed in.  However, only a few top results are displayed.  So even if a lower ranked page is a better result, they are diluted by clicks on higher results by lazier users that use these two features.

This implicit feedback system could also be diluted with spammers and those looking to manipulate page rankings.  Companies are constantly trying to raise their rankings on search websites and increase traffic to their pages.  A large number of clickthroughs with implicit feedback data implementation would achieve this result.  If a spammer wanted to advance a page up site rankings, they could send queries and clickthroughs from botnets to give the engine false implicit feedback data.  Although the practicality of doing this is arguable, and the Google search actually takes into account hundreds of other variables to avoid the influence of spammers, the implicit feedback data usage provides more work for security teams and an avenue for users to manipulate search results.

A well known disadvantage of PageRank is that newer pages are very unlikely to come up in search results.  These new pages have very few incoming links, and are therefore ranked much lower on the PageRank scale.  This problem is actually compounded by implicit feedback data usage.  When a query is submitted to the engine, older, well-established pages will be higher in the results and they have already received more clicks over a longer period of time.  Thanks to implicit feedback data usage, this will raise them even higher in the search results.  In this way, the high search results get higher, and the lower results stay low. The new pages are now even less likely to be clicked on.  In terms of page ranking, this results in the rich getting richer and the poor getting poorer, a common result of network effects.

As gatekeepers of the internet, search companies will likely keep improving their search algorithms and techniques to bring better search results to users.  Even though implicit data collection is already being implemented in Google Search, the way this data is collected, filtered, and used to edit search rankings can always be improved.


-Castor Troy

Source:

http://luci.ics.uci.edu/websiteContent/weAreLuci/biographies/faculty/djp3/LocalCopy/04292009.pdf

Comments

Leave a Reply

Blogging Calendar

November 2012
M T W T F S S
 1234
567891011
12131415161718
19202122232425
2627282930  

Archives