[an error occurred while processing this directive]
That was around April or May of 1993. About that same time, www.mit.edu went online as one of the first 100 web servers in the world. Naturally, this was not MIT's official homepage(2), because at that point, nobody had homepages. It was actually a server set up by a bunch of students that collectively called themselves SIPB (the Student Information Processing Board). Their pages provided a central starting place for exploring MIT's web sites, providing helpful information for "surfers" who were still confused by the whole concept.(3)
http://www.bunyip.com/products/archie/
At the early date of 1990, there was no World Wide Web. Around this time, Tim Burners-Lee probably had a bad dream in which a scary monster with "HTTP" etched into its hide slowly ate up all of the Earth's resources. Nonetheless, there was still an Internet, and many files were scattered all over the vast network.
The primary method of storing and retrieving files was via the File Transfer Protocol (FTP). This was (and still is) a system that specified a common way for computers to exchange files over the Internet. It works like this: Some administrator decides that he wants to make files available from his computer. He sets up a program on his computer, called an FTP server. When someone on the Internet wants to retrieve a file from this computer, he or she connects to it via another program called an FTP client. Any FTP client program can connect with any FTP server program as long as the client and server programs both fully follow the specifications set forth in the FTP protocol.
Initially, anyone who wanted to share a file had to set up an FTP server in order to make the file available to others. Later, "anonymous" FTP sites became repositories for files, allowing all users to post and retrieve them.
Even with archive sites, many important files were still scattered on small FTP servers. Unfortunately, these files could be located only by the Internet equivalent of word of mouth: Somebody would post an e-mail to a message list or a discussion forum announcing the availability of a file.
Archie changed all that. It combined a script-based data gatherer, which fetched site listings of anonymous FTP files, with a regular expression matcher for retrieving file names matching a user query. (4) In other words, Archie's gatherer scoured FTP sites across the Internet and indexed all of the files it found. Its regular expression matcher provided users with access to its database.
Matthew Gray's Wanderer created quite a controversy at the time, partially because early versions of the software ran rampant through the Net and caused a noticeable netwide performance degradation. This degradation occurred because the Wanderer would access the same page hundreds of time a day. The Wanderer soon amended its ways, but the controversy over whether robots were good or bad for the Internet remained.
The term robot has special significance to programmers. Their version of the term is mostly unrelated to the metallic lumbering creatures of Asimov lore. A synonym for robot "automaton" is actually more enlightening. Computer robots are programs that automatically perform a repetitive task at speeds that would be impossible for humans to match, just like the tasks today's robots perform in factories.
On the Internet, the term robot or bot has become a bit broader. For the most part, it refers to programs that explore the Internet for some sort of information. Web robots search the Internet for web pages, usually for the purpose of compiling a large, searchable database. This category of robot is often called a spider. The spider robot falls right into the standard definition of performing a repetitive task.
Other types of robots on the Internet push the interpretation of the automated task definition. The chatterbot variety is a perfect example. These robots are designed to communicate with humans about some topic in a human-like manner. Some of them are fairly convincing; others are obviously quickly written computer programs. Chatterbots are sometimes used as an intuitive way to communicate certain basic information to users. An example is the milk robot, which can answer lots of questions about milk. One could force this type of program into the definition above by saying that it performs the repetitive task of communicating with clueless people.
Unfortunately, the disadvantages of ALIWEB are more of a problem today. The primary disadvantage is that a special indexing file must be submitted. Most users do not understand how to create such a file, and therefore they don't submit their pages. This leads to a relatively small database, which meant that users are less likely to search ALIWEB than one of the large bot-based sites. This Catch-22 has been somewhat offset by incorporating other databases into the ALIWEB search, but it still does not have the mass appeal of search engines such as Yahoo! or Lycos.
This process caused a great deal of controversy because some poorly written spiders were creating huge loads on the network by repeatedly accessing the same series of pages. Most network administrators thought they were a bad thing, so naturally programmers created even more of them.
By December 1993, the web had a case of the creepy crawlies. Three search engines powered by robots had made their debut: JumpStation, the World Wide Web Worm, and the Repository-Based Software Engineering (RBSE) spider.
JumpStation�s web bot gathered information about the title and header from Web pages and used a very simple search and retrieval system for its web interface. The system searched a database linearly, matching keywords as it went. Needless to say, as the web grew larger, JumpStation became slower and slower, finally grinding to a halt.
The WWW Worm indexed only the titles and URLs of the pages it visited. It used regular expressions to search the index. Results from JumpStation and the Worm came out in the order that the search found them, meaning that the order of the results was completely irrelevant. The RSBE spider was the first to improve on this process by implementing a ranking system based on relevance to the keyword string.5
Their project was fully funded by mid-1993. Once funding was secured. they released a version of their search software for webmasters to use on their own web sites. At the time, the software was called Architext, but it now goes by the name of Excite for Web Servers.
The Galaxy went online in January 1994. It contained Gopher and Telnet search features in addition to the web-searching features. Interestingly enough, Gopher was vastly popular as a document-sharing tool when the web was born. The Gopher search capability was probably the primary reason for the creation of the EINet Galaxy. (There weren�t really very many web pages to search through in January 1994!) The web page search capability was simply an additional feature.
Through the present, Tradewave (www.tradewave.com) still clings to its directory-based roots; it uses no bots or spiders to seek out new URLs. Therefore, the Galaxy is a true directory in the sense that it lists only URLs that have been submitted to it, and all categorization and review of the submitted URLs is done by hand. This results in higher-quality pages and more relevant searches, but far fewer pages to search through.
As the number of links grew and their pages began to receive thousands of hits a day, the team created ways to better organize the data. In order to aid in data retrieval, Yahoo! (www.yahoo.com) became a searchable directory. The search feature was a simple database search engine. Because Yahoo! entries were entered and categorized manually, Yahoo! was not really classified as a search engine. Instead, it was generally considered to be a searchable directory. Yahoo! has since automated some aspects of the gathering and classification process, blurring the distinction between engine and directory.
The Wanderer captured only URLs, which made it difficult to find things that weren�t explicitly described by their URL. Because URLs are rather cryptic to begin with, this didn�t help the average user. Searching Yahoo! or the Galaxy was much more effective because they contained additional descriptive information about the indexed sites.
The history of WebCrawler is best told by those responsible:
"In early 1994, students and faculty in the Department
of Computer Science and Engineering [of the University of Washington]
gathered in an informal seminar to discuss the early popularity of
the Internet and the World-Wide Web. Students typically try out their
ideas in small projects in these seminars, and several interesting
projects were started. The WebCrawler was Brian Pinkerton's project,
and began as a small single-user application to find information on
the Web.
Fellow students persuaded Pinkerton to build the Web interface
to the WebCrawler that became widely usable. In that first release
on April 20, 1994, the WebCrawler's database contained documents
from just over 6000 different servers on the Web. The WebC rawler
quickly became an Internet favorite, receiving an average of 15,000
queries per day in October, 1994 when Pinkerton delivered a paper
describing the WebCrawler."
The most important point about WebCrawler is that it was the first full-text search engine on the Internet. Until its debut, a user could search through only URLs or descriptions. The descriptions were sometimes created by the engines themselves or reviewers trying to rate the sites.
A final word about WebCrawler from the company itself: "Several competitors emerged within a year of WebCrawler�s debut: Lycos, Infoseek, and OpenText. They all improved on WebCrawler�s basic functionality, though they did nothing revolutionary. WebCrawler�s early success made their entry into the market easier, and legitimized businesses that today constitute a small industry in Web resource discovery."(www.webcrawler.com)
"Work on the Lycos spider began in May 1994, using John
Leavitt's LongLegs program as a starting point. (Lycos was named for
the wolf spider, Lycosidae lycosa, which catches its prey by pursuit,
rather than in a web.) In July 1994, I added the Pursuit retrieval
engine to allow user searching of the Lycos catalog (although Pursuit
was written from scratch for the Lycos project, it was based on experience
gained from the ARPA Tipster Text Program in dealing with retrieval
and text processing in very large text databases (9) ). On July 20,
1994, Lycos went public with a catalog of 54,000 documents. In addition
to providing ranked relevance retrieval, Lycos provided prefix matching
and word proximity bonuses. But Lycos' main difference was the sheer
size of its catalog: by August 1994, Lycos had identified 394,000
documents; by January 1995, the catalog had reached 1.5 million documents;
and by November 1996, Lycos had indexed over 60 million documents
-- more than any other Web search engine. In October 1994, Lycos ranked
first on Netscape's list of search engines by finding the most hits
on the word �surf.�"(6)
Initially, Infoseek was just another search engine. It borrowed conceptually from Yahoo! and Lycos, not really innovating in any particular way. Yet the history of Infoseek and its current critical acclaim show that being the first or most original isn�t always that important. Infoseek�s user-friendly interface and the numerous additional services (such as UPS tracking, News, a directory, and the like) have garnered kudos, but it was Infoseek�s strategic deal with Netscape in December 1995 that brought it to the forefront of the search engine line. Infoseek convinced Netscape (with the help of quite a bit of cash) to have its engine pop up as the default when people hit the Net Search button on the Netscape browser. Prior to this, Yahoo! was Netscape�s default search service.
The rest of its features, all available from introduction, changed the face of search engines forever. AltaVista was the first to use natural language queries, meaning a user could type in a sentence like "What is the weather like in Tokyo?" and not get a million pages containing the word "What." Additionally, it was the first to implement advanced searching techniques, such as the use of Boolean operators (AND, OR, NOT, etc.). Furthermore, a user could search newsgroup articles and retrieve them via the web as well as specifically search for text in image names, titles, Java applets, and ActiveX objects. Additionally, AltaVista claims to be the first search engine to allow users to add to and delete their own URLs from the index, placing them online within 24 hours.
One of the most interesting new features AltaVista provided was the ability to search for all of the sites that link to a particular URL. This was very useful for web designers who were trying to get some popularity for their pages; they could frequently check to see how many other pages were referencing them.
On the user interface end, AltaVista made a number of innovations. It put "tips" below the search field to help the user better formulate a search. These tips constantly change, so that after using the search for a few times, users see a number of interesting features that they possibly did not know about. This system became widely adopted by the other search engines.
In 1997, AltaVista created LiveTopics, a graphical representation system to help users sort through the thousands of results that a typical AltaVista search generates. LiveTopics is interesting as a search tool, but conceptually it is more confusing than the standard search format. Although its innovative qualities are uncontested, its effectiveness remains to be seen (altavista.software.digital.com/search/showcase/two/index.htm).
The Inktomi search engine was quickly licensed to Wired magazine�s web site, HotWired. This site�s popularity accounted for much of the initial fervor over HotBot. Wired�s reputation as the oracle of the Net made promoting the site fairly straightforward.
So what�s the big deal? Just another search engine? Well, yes and no. HotBot is probably the most powerful of the search engines, with a spider that can supposedly index 10 million pages per day. According to the Wired web site, HotBot should soon be able to reindex its entire database on a daily basis. This will ensure that the pages returned from a search are not out of date, which is now common with other search engines.
Additionally, HotBot makes extensive use of cookie technology to store personal search preference information. A cookie is a small file that a site can store on your computer. This file can be read only by the site that generates it. It can hold a small amount of text or binary information. This information is often used by sites to store customization information or to store user demographic data.
HotBot recently won the PC Computing Search Engine Challenge, a contest between the major search engines. Representatives from each company were asked questions that could be answered only by a web search. The engine that most effectively led the representative to the right answer won the question. Although this challenge proved very little more than the searching abilities of the various representatives, it still garnered quite a bit of critical acclaim for HotBot, further increasing its popularity.
The current solution to this problem is the META engine. META engines forward search queries to all of the major web engines at once. The first of these engines was MetaCrawler. MetaCrawler searches Lycos, AltaVista, Yahoo!, Excite, WebCrawler, and Infoseek simultaneously.
MetaCrawler was developed in 1995 by Eric Selburg, a Masters student at the University of Washington (the same place where WebCrawler was developed a few years earlier). Like WebCrawler, MetaCrawler soon grew too large for its university britches and had to be moved to another site. Here, Eric tells the story of how MetaCrawler became the go2net search engine:
MetaCrawler was conceived in spring of 1995 by myself
and my advisor, Oren Etzioni, as my master's degree project. It grew
rapidly in popularity once we released it publicly, gaining many new
users after Forbes mentioned us in a cover-page article. Use jumped
after C|Net reviewed all the major search services, ranking us No.
1, with AltaVista No. 2 and Yahoo No. 3...
In May of 1996, I (along with most of the rest of the AI department
at UW) created NETbot. ...When I left NETbot to return to research
at UW� MetaCrawler was now under 7 � 24 monitoring service, the
code was as reliable as ever, and we had made several performance
improvements. ...
There was a realization that Netbot was ill-equipped to handle
negotiations with the search services for continued MetaCrawler
use. Thus, the decision was made to license MetaCrawler to go2net,
who could provide the resources necessary to make MetaCrawler viable
as well as negotiate with the search services toward mutually beneficial
arrangements. (www.metacrawler.com/selberg-history.html)
MetaCrawler functions by reformatting the search engine output from the various engines that it indexes it onto one concise page. Throughout MetaCrawler�s history, the search engine companies that it worked with did not entirely approve of this procedure. The most common complaint was that the advertising banners that the search engines had on their sites were not appearing when a user employed MetaCrawler. This meant that their ads were not reaching the intended audience, reducing their ad revenues.
The move to go2net heralded MetaCrawler�s concession to these concerns. Now MetaCrawler displays the ads from each search site right above the results. MetaCrawler users were not thrilled by this change because it increased the time it took for the result page to download. However, skillful design of the result pages now causes the text to load first, calming the restless native users.
Soon, another reason for having a "private" search engine became apparent. Unlike most other media, a web page is constantly updated, and new pages are added to and removed from sites every day. None of the major web-based search engines could search the entire web on a daily basis. Therefore, the search databases would often contain out-of-date references or would miss entire sections of web sites. The larger sites began indexing their own sites and providing search engines that would primarily search through their own materials. Some allowed the user to search the rest of the web as well by linking the engine into one of the larger web databases such as AltaVista.
Many relatively small sites are now providing search engines for their own sites. This is because search engines are becoming easier and easier to use and incorporate within a web site, and because the rapid growth of the web has led to an incredible amount of "junk" in the form of out-of-date pages, pages with misleading descriptions, pages deliberately designed to confuse search engines, and so on. Additionally, it is often difficult to know what to search for, and many users have a hard time expressing what it is they wish to find in a language that search engines can effectively understand. Using a site-specific search engine narrows the possibilities enough that a poorly formulated search may still return the intended result.
Now that we�ve finished our search engine history lesson, you should be somewhat familiar with a number of the key players in the search engine area. Additionally, you should be starting to get a feeling for some of the issues that search engines face.
The next chapter takes a closer look at some of the engines mentioned here as well as a few others. You�ll learn how users interact with each engine. Ultimately, you�ll understand the strengths and limitations of today�s search techniques and what users have come to expect from a search engine. This knowledge is extremely important when choosing a search engine for your own web site. It will help you determine if a particular engine can handle the task you need it to accomplish. You�ll also be able to better understand how your users will interact with the engine you choose.
2. This fact did not thrill MIT network administrators when the web became popular a year later. Although they made an attempt to wrestle the URL away from SIPB, the students prevailed, and to this day MIT�s own homepage is located at http://web.mit.edu. There is an interesting allegory relating to this at the bottom of SIPB�s main page at http://www.mit.edu for those that are curious.
3. such as the document, "Inessential Refrigerator Restocking," which is still available at: http://www.mit.edu:8001/sipb/documents/
4. Michael Maudlin, "Lycos: Design choices in an Internet search service" 1997
5. The name Veronica officially expands to Very Easy Rodent-Oriented Netwide Index to Computerized Archives -- somehow I think they worked the expansion out afterwards, but you decide.
6. Michael Maudlin, "Lycos: Design choices in an Internet search service" 1997
Copyright © 1997 Wes Sonnenreich. All Rights Reserved.