Article Summary for Lecture #10—Northedge

In “Google and beyond: information retrieval on the World Wide Web” Richard Northedge examines the methods and problems involved with indexing Internet web pages. He provides a brief history of approaches to online information retrieval, describing how the various methods developed and evolved, and offering useful descriptions of both human-generated web directories and automated search engines. Northedge notes that in the early days of the World Wide Web, directories such as Yahoo! capably offered adequate lists of Internet web pages. However, the growth of the web quickly outpaced the ability of humans to index all the information it contained, which, according to Northedge, led to increased use of search engines. In spite of this prominent trend, online information retrieval processes still have not become fully automated but, at the time Northedge writes, feature a number of human-driven alternatives. Rather than anticipating the end of human efforts within this process, Northedge envisions a future of online information retrieval dominated by computer-generated indexes operating with datasets and language standards provided by humans.

Northedge cites Clay Shirky to point out the “failure of traditional human classification techniques when applied to the web.” Human indexing, Northedge continues, works best within a small corpus, with a fixed and unchanging text, featuring clearly defined categories and a controlled vocabulary. (192) The rapid and broad expansion of the web, consequently, made it an environment unsuitable for application of human-indexing methods. Search engines, such as the one operated by Google, provide a more appropriate means of indexing the billions of available web pages. Instead of scanning each web page in its entirety, explains Northedge, search engines employ a software program known as a “spider” or “robot” that continuously scans web pages, analyzes their content, and stores certain aspects of that content in databases. The particular means of collecting and storing the information gleaned from the web pages (done through the use of algorithms) is often proprietary—as is the case with Google. It is, nevertheless, understood that information is collected from web pages and stored as metadata that is, in turn, scanned by the search engine when users perform searches to produce relevant retrievals. (193)

While this process has become standard, Northedge notes several alternatives to the use of automated algorithms in indexing web content. One approach is to allow the creators of web pages to assign their own subject keywords—not a practical solution due to the potential for abuse of the system. A more widely accepted trend, says Northedge, is the assigning of keywords by the users of the web pages in what has been called “tagging” or “folksonomy.” Variability in language use is perhaps the biggest drawback to this approach, however. Because of such limitations, many have given up on looking for a way to index the web in its entirety and, instead, advocate indexing only those resources deemed “high-quality.” Northedge points out that others are focusing their efforts on ways to overcome the language variance problems of search engines—examples include shifting from word-based to “lexeme”-based systems and David Crystal’s taxonomic database, Textonomy. (194)

The strengths of Northedge’s article reside in his concise but useful description of search engines and his examination of these tools against the problems presented by indexing web content. With inexhaustible resources, humans currently are still better indexers than automated systems. However, time and money are costly and, so, computers must be involved in this process. As such, Northedge’s assertion that the future will consist of computer-generated indexing controlled by human-driven datasets and language standards seems a reasonable supposition. (194) Meanwhile, the web is becoming more extensive, but technological systems are becoming more sophisticated along with it. The next decade could hold unexpected shifts within the world of web page indexing.


