Local time: Friday, 05-Dec-2008 12:24:06 EST
Last update: at /special/conference/apwww95 , Friday, 21-May-2004 09:47:10 EST

Searching the Web for Information: How do we Fare?

Gillian Westera

Curtin University of Technology
GPO Box U1987
Perth, Western Australia 6001

gillian@boris.curtin.edu.au


Abstract
The World Wide Web is an incredibly fast-growing phenomena and with this growth comes an overwhelming amount of information which is not locatable in any structured manner. Organisations and individuals have created their own search engines and it is these which make the Web more accessible. This paper studies some of the more well-known search engines available in an attempt to discover the best for users in Australia. Their accessibility, search capabilities, interface and the resultant list of matched documents is examined.
Keywords
search engines comparison, robots, InfoSeek, Lycos, World Wide Web Worm, WebCrawler

Introduction

The World Wide Web has changed the face of the Internet within the last year. Anyone can create information and make it available for all to see on the Web. There are no controls in place to stop an individual from becoming a publisher. This glut of information therefore needs to be accessed in a manner which is simple for the everyday user. As an information services librarian I am interested in the workings of search engines from a user point of view. I am therefore not looking into the technicalities behind the engines discussed in this paper (leaving that to the professionals). I want to know what is best for the user, taking into consideration the search requirements and resultant hit lists.

Due to the large amount of information available it is quite difficult for users to find what they are looking for. Because of this, enterprising people and organisations have created methods of finding information more quickly. There are three main methods people use, all of which have their merits:

a) Surfing

Surfing involves serendipity. It is an excellent method of locating information which a user may not have considered available on the Web. It involves starting somewhere and just following the links and is the method most people seem to use when they first begin using the World Wide Web. This provides novices with an idea of what sort of information is available but is not a reliable method to find a particular piece of information in Webspace.

b) Subject trees/lists

These are lists of lists guiding users to useful resources on a particular subject or of a particular type. The user must select from various lists which guide him/her to possible relevant documents. An excellent example of this is Yahoo (Yet Another Hierarchical Officious Oracle) which allows the user to search through its hierarchy and includes some useful search options including narrowing by finding results only in the title or URL, using Boolean AND or OR, and offering word stemming. While the documents found can be extremely useful, the information is only searched for by site, not by content. Yahoo also offers a search engine to access these lists more easily. While subject trees/lists are sometimes limited in the documents available, they can offer quality resources and may often be better than straight keyword/phrase searching. This is because documents of a similar subject/type may be grouped together.

c) Keyword/phrase searching via robots

The use of search engines which use robots to automatically locate information across Webspace offer varying levels of searching capabilities. The results appear in hypertext and can immediately be selected to link to the required documents (as for subject trees). The main difference is that they tend to encompass much more of what's available on the Web. Robots (also known as spiders, wanderers and worms) attempt to index defined parts or all of Webspace and Martijn Koster has put together an excellent list of these [KOS95-1]. He has also co-written 'Guidelines for Robot Writers' [KOS95-2] which many robot creators follow.

All three methods of locating information on the Web are valid and the user will find that one method is more appropriate than another depending on the type of information required. The rest of this paper discusses keyword searching via robots and compares some of the more popular and easily accessible search engines available in an attempt to find the best available for the general Web user.

Search Engines Using Robots

There are quite a variety of robots available on the Web. The following search engines were used in my tests (results will be discussed later): Harvest, InfoSeek, JumpStation II, Lycos, OpenText, RBSE, Wandex (World Wide Web Wanderer), WebCrawler, and WWWW(World Wide Web Worm). Out of these, only four retrieved most/all of the documents searched for. These are looked at more closely:

InfoSeek (http://www2.infoseek.com/)

(Free search engine section only considered)

General information: InfoSeek is a commercial provider which also offers free access. This free access is supposed to be limited though I found the search speed and result lists excellent (though results are limited to the first ten). InfoSeek's free search engine 'serves more than 500,000 queries per day' [INF95]. 'On June 3, 1995, InfoSeek had indexed 20% more data than Lycos (1.5 Gbytes of data). No WWW page was older than 7 days. All data was freshly loaded' [INF95]. Note that this information is used as publicity by InfoSeek. It complies with the 'Standard for Robot Exclusion' developed by Martijn Koster et al [KOS95-3]. InfoSeek searches full text but only indexes World Wide Web sites. It plans to keep Web site links for two years.

Updating of information: InfoSeek is completely upgraded every 48 hours and locates Web sites world-wide. It is possible for users to submit URLs to be included in this index.

User interface: The search interface is very basic offering a query box and clickable buttons 'Run Query' and 'Clear Query Text'. There is an option to look at 'helpful tips' which provides advanced searching capabilities which are quite sophisticated. InfoSeek doesn't support truncation or obvious Boolean logic though the Boolean AND is implied in a user's search. An extremely comprehensive FAQ is available which offers comparisons between Boolean logic and InfoSeek's syntax [INF95]. Since terms are stemmed, automatic truncation occurs when searching. It is not possible to ask for an exact match. Exclusion of unwanted terms is possible (i.e. implied Boolean 'OR'). Proper nouns can be searched for (i.e. 'Gates' rather than 'gates') to gain a more precise result.

The search speed is consistently the fastest I have come across out of the various search engines I have used. The resultant hit list offers the title of the link in hypertext and includes keywords/phrases in context. The URL is also listed along with the size of the document. The resultant list for the free search engine is up to 10 hits long.

Lycos (http://www.lycos.com/)

General information: Developed by Dr. Michael L. Mauldin Lycos is available at Carnegie-Mellon University. It complies with the "Standard for Robot Exclusion" developed by Martijn Koster [KOS95-3] and identifies itself at each site it visits.

At present Lycos doesn't offer Boolean searching or adjacency searching though this is being considered for the future [WEI95]. Lycos is case insensitive. There are two different databases which can be selected to search through; the smaller one offers a significantly smaller index but is more readily accessible.

Updating of information: 'The Lycos web explorer searches the World Wide Web every day (including Gopher and FTP space), building a database of all the web pages it finds. The index is updated weekly' [CAR95]. It does not index 'ephemeral or changing data or infinite virtual spaces...the following are not considered part of the Web: WAIS databases, USENET news, Telnet services and Email' [MAU95]. It is possible for a user to submit a URL to be included in Lycos. While the information is updated, older links appear to remain [STA95] which leads to a false result list. As of August 2nd, Lycos contains 5628298 unique URLs.

User interface: The user interface is very user friendly. There is a link to 'search language help' which gives hints on how to type in terms. It is possible to limit the number of hits to a particular number and to request that a minimum number of the terms typed in must appear in the result. Users can also select the minimum relevancy score though this may not be straightforward for the novice user. It is also not always the best method of ranking results since it assumes that if a term is mentioned often, the document is more relevant. The results can be viewed in a brief format or with added information. The search engine uses automatic truncation and it is also possible to force an exact match. Unwanted terms can easily be excluded.

The resultant list providing full information supplies the results in relevancy order; provides each result with a relevancy score; includes the URL which is in hypertext and so can directly link the user to the wanted information; lists the last time it was fetched, its size in bytes and how many links the document has; and provides the title of the document, an outline, and an excerpt from the document which includes when it was last updated. This excerpt is very useful in deciding whether a particular document is relevant or not and can eliminate the need to connect to the source to check further.

WebCrawler (http://webcrawler.com/)

General information: Developed by Brian Pinkerton at the University of Washington in 1994 WebCrawler is now owned by America Online, Inc. There are two methods of searching; via the general menu or using a simpler mode which doesn't offer Boolean searching. WebCrawler doesn't follow Koster's Standard for Robot Exclusion [KOS95-3] The search engine does full text searching and when retrieving terms, strips them down and puts them all in lower case (i.e. searching is case insensitive). URL, title and content information is all indexed. It has a list of stopwords which are not searched on (eg WWW). As of June 1st, 1995 'more than 250,000 users a week search the WebCrawler's index of 29,000 Internet sites world-wide. WebCrawler indexes more than 2000 new sites monthly' [AME95]. WebCrawler holds information on over 1.5 million different documents [WEB95].

Updating of information: WebCrawler usually has an indexing run every week. Individuals can submit URLs to be added to WebCrawler. There is a 'Web Top 25 List' which lists the most frequently accessed sites found by WebCrawler. This is continually updated.

User interface: The user interface is straightforward. The query box must be clicked on to type in terms. There is an 'AND words together' box which can be selected, offering basic Boolean searching. It is possible to limit the number of hits in the result list. The search page guides the user to a 'Searching Hints' page as well as a FAQ. The resultant list only provides one line of information and doesn't include the URL (which I believe is essential). The results are hypertext-linked to the site. The results are ranked using a relevancy guide with 1000 being the most relevant (similar to Lycos).

WWWW (WWWW) (http://www.cs.colorado.edu/home/mcbryan/WWWW.html)

General information: WWWW was developed by Oliver McBryan and 'provides four types of search databases: citation hypertext, citation addresses (URL), HTML titles and HTML addresses. The latter two are much smaller databases, which can therefore be searched faster' [MCB95?]. It doesn't index the contents of documents, only 'URLs and hyperlinks containing URLs, many of which aren't links to documents.' [PFA95, P.12]. The index is case insensitive. It 'serves 3,000,000 URLs to 2,000,000 folks/month' [MCB94]. It follows Koster's robot guidelines.

Updating of information: Users can submit home pages to be included in WWWW. I could find no information on how often it is updated.

User interface: The search interface is user friendly offering pull down menu options and a query box to input keywords. The first pull down menu allows the user to search all URL references, all URL addresses, just document titles, or just document addresses. The user can also select to AND or OR all keywords (AND is the default), and can limit the number of search results. Help is available by selecting from Instructions, Definitions, Examples or Failures (all lead to the same document). It offers advanced searching for those who are familiar with 'regular expression syntax as used in the UNIX egrep program' [MCB95?] (i.e. not for the general user).

The resultant list shows which keywords were searched for and lists the results by name of document and URL with both being in hypertext. At the bottom of the list the user is advised if the WWWW found more matches than he/she chose to view. If a resultant list is empty the user is guided to help documents to assist. Results may include images and sound.

Tests of Search Engine Capabilities

To test the capabilities of the search engines decided on, I chose eight documents which I know to exist and then tested to see if the search engines could locate these documents. By choosing known documents I was able to select documents from around the world to see geographically if the documents were easily located by the various search engines. I was therefore interested in finding out whether the search engine could find the site and not whether it could find information on a particular topic, though it is often useful to discover other interesting documents which appear in result lists.

I only considered the first five documents listed in each result list since the general searcher is not really interested in looking through a huge list to locate the desired information. Ideally, the 'hit' should appear within the first five results (though very general search terms will always provide a very general result list). I was also interested in finding out if the links were direct (i.e. the exact document was listed with a link in the result list) or indirect (more than one leap and less than four leaps away from the exact match).

The following Web sites/resources were chosen based on where their sites are geographically:

The WebMuseum at 
http://mistral.enst.fr/~pioch/wm/net/

The Virtual Frog Dissection Kit at http://george.lbl.gov/ITG.hm.pg.docs/dissect/dissect.html)
InfoMap at http://www.sg/
The Treaty of Waitangi at http://www.govt.nz/tow/
Commonwealth Budget Papers at http://www.nla.gov.au/finance/budget95/budget95.html
Common Birds of the Australian National Botanic Gardens at http://155.187.10.12/anbg/birds.html
The Telerobot at the University of Western Australia at http://telerobot.mech.uwa.edu.au/
The Curtin Online Handbook (Curtin University) at http://www.curtin.edu.au/curtin/handbook/

While accessing the various search engines I tried to use the same terms to give the searches some consistency. Where a search engine offered additional search capabilities (such as Boolean searching, wild card searching and limiting the number of hits) I chose to take advantage of these since they exist to enhance the capabilities of the search engine. The following terms/phrases (or part of) were used:

	I     webmuseum
	II    frog dissection kit
	III   infomap
	IV    waitangi treaty
	V     budget papers australia
	VI    birds botanic gardens australia
	VII   telerobot university western australia
	VIII  curtin handbook

Time taken to retrieve results was taken into consideration and it was found that if accessed early in the morning all of the search engines were quite fast (generally less than a minute to bring up a set of results). To access most of these search engines late in the day (generally midday onwards West Australian time) it was difficult to gain access at all. The one exception was InfoSeek which I found to be quite fast most times during the day.

The Netscape browser (version 1.1N) on a Macintosh computer was used for all tests.

Resultant Summary of Tests

It appears that the commercial search engines are becoming very good. Lycos, which I have always found excellent came second to InfoSeek which has the added advantage of being easily accessible most times during the day. Lycos would benefit from offering a more advanced query language to compete more successfully with InfoSeek. WebCrawler has improved significantly over the past two months since I began my tests. It rated closely behind Lycos, also locating all documents. WWWW, OpenText and Harvest did not locate all documents (all non-Australasian documents were found). The other search engines did not rate well at all and I wouldn't recommend using these in their present states.

I had chosen sites which were geographically significant and my results illustrate that the search engines are actively collecting world-wide and not just in their locality. It is also interesting to note that most of the indirect hits I found were for sites within Australasia. Does this imply that the robots don't 'roam' as widely as they should?

While search engines which use robots offer an excellent method of locating useful information on the Web, it is obvious that it is difficult for everything to be indexed. And even with what is indexed, it is very easy to retrieve irrelevant documents. I would like to see more agreement between the different search engine creators to create something which combines all the best qualities available in each of the robots. SavvySearch does take advantage of some of the search engines available by allowing users to search through more than one search engine at a time which is an excellent service.

There are marked differences between the different search engines in how a query may be entered. Some of these search engines would benefit with more straightforward and sophisticated query operators to offer searchers the opportunity for a better result. Lycos offers excellent result list content which some of the other search engines would benefit from offering (particularly search engines such as WebCrawler which only offers the name of the document). By including the extra information, the searcher can often decide whether it is worth accessing the document or not.

Conclusion

Locating information is often quite difficult if terms are quite broad. My research for this paper illustrates this. It was almost impossible to locate information about the search engines I have described since my keywords would only bring up the search engines themselves. I found that by finding home pages of key people and using Yahoo via the subject list (and not the search engine) I was guided to excellent information. We may just have to concede that the World Wide Web will never be capable of indexing Webspace and that the subject tree approach is an excellent alternative. As Martijn Koster says in his article 'Robots in the Web: threat or treat?', robots 'will become less effective and more problematic as the Web grows' [KOS95-4] and with the rate it is growing at this must be happening already. It is obvious that using the different methods mentioned in tandem is the best way to find information in Webspace. Let's hope that Australian documents will be more thoroughly indexed by future robots.


Bibliography

[BER94]
Berners-Lee, T., Cailliau, R., Luotonen, A., Nielsen, H.F., and Secret A.: The World-Wide Web. Communications of the ACM, 37(8), pp.76-82.
[DEC94]
December, J., & Randall, N.: The World Wide Web unleashed. Indianapolis, U.S., Sams Publishing.
[PFA95]
Pfaffenberger, B.: World Wide Web bible. New York, MIS Press.
Hyptertext References:
[KOS95-1]
Koster, M.: World Wide Web Robots, Wanderers, and Spiders. http://web.nexor.co.uk/mak/doc/robots/robots.html
[KOS95-2]
Koster, M., Fletcher, J., McLoughlin, L. and others: Guidelines for robot writers. http://web.nexor.co.uk/mak/doc/robots/guidelines.html
[INF95]
InfoSeek Corp. InfoSeek FAQ. http://www.infoseek.com/doc/FAQ/
[KOS95-3]
Koster, M.:A standard for robot exclusion. http://web.nexor.co.uk/mak/doc/robots/norobots.html
[WEI95]
Weiss, A.: Hop, skip, and jump: Navigating the World-Wide Web. http://www.mecklerweb.com:80/mags/iw/v6n4/feat41.htm
[CAR95]
Carnegie Mellon University: Lycos: frequently asked questions. http://lycos.cs.cmu.edu/lycos-faq.html
[MAU95]
Mauldin, M.L.: Measuring the Web with Lycos. http://lycos.cs.cmu.edu/lycos-websize.html
[STA95]
Stanley, Tracey: Searching the World Wide Web With Lycos and InfoSeek. http://www.leeds.ac.uk/ucs/docs/fur14/fur14.html
[AME95]
America OnLine, Inc.: Press release, June 1st. http://webcrawler.com/AOL/Press/PR.060195.txt
[WEB95]
WebCrawler FAQ: http://webcrawler.cs.washington.edu/WebCrawler/FAQ.html
[MCB95?]
McBryan, O.: Instructions. http://www.cs.colorado.edu/home/mcbryan/WWWWintro.html
[MCB94]
McBryan, Oliver: GENVL and WWWW tools for taming the Web. http://www.cs.colorado.edu/home/mcbryan/mypapers/www94.ps
[KOS95-4]
Koster, Martijn: Robots in the Web: threat or treat? http://web.nexor.co.uk/users/mak/doc/robots/threat-or-treat.html

[Return to Table of Contents]
COPYRIGHT © 1995 by AUUG95 and APWWW95 Charles Sturt University. ALL RIGHTS RESERVED. ISBN 1 875781 43 9