Local Time: Sunday, 05-Jul-2009 04:54:08 EST
Last Modified: Friday, 21-May-2004 09:47:03 EST
![]()
From Honeypots to a Web of SIN - Building the World-Wide Information System
David G. GreenJohnstone Centre, Charles Sturt University,
PO Box 789 Albury New South Wales 2640 AUSTRALIA
dgreen@csu.edu.au
/dgg.html
- Abstract
- In the World-Wide Web we see the beginnings of a truly global information system. However several needs stand out in the information resources currently available. Perhaps the most obvious is to make it easier for users to locate the information they want. The needs for greater standardization, quality control and stability are equally acute. The prevalent "honeypot" model, in which sites compete to attract users, does not address these needs at all. The "special interest network" (SIN) model attempts to create communities of users and suppliers who cooperate to form complete information environments on particular topics. The main functions are communication, publication, virtual libraries and on-line services (including commerce). Experience shows that the SIN model encourages participation and accommodates growth.
- Keywords
- Australia Index, coordination, environment, honeypot, organization, quality assurance, special interest networks (SIN), stability, standards
Introduction
The World Wide Web is at a critical stage of development. On one hand the Web is beginning to change the very ways in which we do things. In many professional areas, such as business, research and teaching, we are seeing the first signs of new paradigms, new practices and new institutions. On the other hand the Web's fantastic growth has gone on with virtually no coordination of its content.At present the provision of on-line information is dominated by a very competitive environment, which discourages collaboration and inhibits cooperation. The quest of Web sites for attention has favoured vacuous style over substance and encouraged massive duplication of effort. As a result the development of services has been a very patchy. There are vast tracts of information about some subjects; almost nothing about others.
And yet models for cooperation do exist. There is a growing trend in many areas of on-line activity towards large scale projects that involve contributions from many sources. Perhaps the best example is the growth of molecular biology databases. International databases, such as Genbank [Bilofsky88] and EMBL [Cameron88], are public compilations consisting of contributions from thousands of scientists. Attempts are now underway to expand this practice into other areas, such as biodiversity (e.g. [Burdet92], [Canhos92], [Green94]).
There are good reasons for seeking to promote cooperative models for Web activity. Existing projects, notably the WWW Virtual Library and the Virtual Tourist show what can be achieved through cooperative development of information resources. I argue that cooperative approaches are essential if the Web is to achieve its full potential. With this in mind I here I propose a general model for large-scale collaboration on the World Wide Web and provide some examples of its implementation.
The honeypot effect
If you want to sell something, then people have to see it. This age old law of the marketplace still holds true on the Web. And the best way to get people to see what you have to offer is to attract their attention by giving away something interesting, free or useful (preferably all three). This effect is much like bees being attracted to a "honeypot". Once people find a site that provides something they want they almost invariably explore that site to find out what else it has to offer. The point is that attracting an audience makes it possible to "sell" ideas, products or services very effectively. Thus the Web blurs the line between publishing and advertising: a successful publication is the best form of advertising.The Web at present is dominated by this honeypot effect. O'Reilly's "Global Network Navigator" (GNN) is perhaps the best known commercial example, but all successful sites exploit it in one way or another. Many of the innovations appearing on the Web arise from the honeypot effect. "Cybermalls", for instance, are really extensions of the way that similar businesses tend to congregate together in the real world.
Web managers need to be aware of it. For instance, many sites sell themselves short because they are too inward-looking. A network presence is an effective method of promotion, but simply setting up a Web server is not enough. The most common mistake that sites make is to hide really interesting and useful information behind layers of details about their internal organization. At the other extreme some sites mistakenly think that flashy design can make up for lack of substance.
However the honeypot effect has many disadvantages. It encourages wasteful duplication of effort. In their desire to attract users, for instance, virtually every Australian commercial server tries to duplicate Charles Sturt University's Register of Australian Web servers. Worst of all the honeypot promotes an extremely competitive atmosphere that has so far prevented the Web community from addressing many pressing needs.
Needs in network publishing
As the volume and variety of network information grows, several needs are increasingly evident. These include:
- Organization
- At present the Web is not well organized. The most common complaint by users is how difficult it is to find what they want. Virtual libraries, such as the WWW Subject Index and Geographic Index, and massive search engines, such as Lycos and Yahoo, do not make up for the lack of coordination between the sources of information.
- Stability
- The most frustrating problem, for users and managers alike, is that important sources of information frequently go "stale". Present mechanisms for reporting URL changes are simply not effective. However the solution is not to concentrate information at a single centre. An important principle is that the site that maintains a piece of information should be the principal source. Copies of (say) a dataset can become out-of-date very quickly, so it is more efficient for other sites to make links to the site that maintains a dataset, rather than take copies of it.
- Quality Control
- We all want information that is valid, up-to-date and accurate, and software that works correctly. These needs are even more acute in professional applications where errors could have major financial or legal implications. At present the only real form of quality assurance is the professional reputation of major institutions. However the supply of information is so patchy that users often accept whatever is offered.
- Standardization
- The Web offers the prospect of linking together all information resources on any given topic. However this is only possible if the form and content of information from different sources conform to standards. Perhaps the most urgent need is for documentation standards for recording sources.
- Scalability
- One of the more frustrating aspects of the Web's rapid growth has been that network traffic has increased in direct proportion. Much of this traffic is repetitious and unecessary. For example 10,000 Australians pulling down the same file from Europe is a waste of international bandwidth. Mirrors, proxies and caches provide the obvious solution, but need much more coordination to be really effective.
Special interest networks
A Special Interest Network (SIN) is a group of sites on the Web that collaborate to provide comprehensive information about a particular subject. Note that SINs are organizations for coordinating the development of information. SINs should not be confused with the physical networks that connect computers together. Just as computer networks link together computers, so SINs link together information, people and activity on particular topics.The main functions of a SIN fall into the following four headings:
- Publication - the SIN publishes information on the specialist topic. Besides articles and books in the traditional sense, publications can also include datasets, images, audio, and software. SINs adopt the fundamental principle that the supplier of a piece of information should also be its publisher. That is, rather than take (say) data from many different sites and place it all on a single server, each site runs its own server and publishes its own data. The logical endpoint of this trend would be a server on EVERY computer, with every individual user being his/her own publisher.
- Virtual Library - the SIN provides users with access to information on the specialist topic. Besides information stored on-site, there are links to relevant information elsewhere.
- On-line Services - a SIN can provide relevant services, such as analyzing data, to its users. On-line services include virtually all commercial activity.
- Communications - a SIN provides a means for people in the field to keep in touch. Communication includes many existing Internet activities, such as mailing lists, Usenet newsgroups, newsletters, and relay chat conferences.
A SIN consists of a series of participating "nodes" that each contribute to the network's functions. More specifically the nodes carry one or more of the following:
- Accept and store relevant, contributed material;
- Provide some form of public access for users;
- Provide some unique information, or mirror other sites;
- Provide organized links to other nodes;
- Coordinate their activity with other nodes.
SINs are the network equivalent of professional societies. Some may even be the communications medium for such societies (e.g. [Burdet92]). We can also consider SINs as a logical extension of newsgroups and bulletin boards. In short they aim to provide a complete working environment for their members and users.
A good example of a SIN is the European Molecular Biology Network (EMBNet). EMBNet is a special interest network that serves the European molecular biology and biotechnology research community. It consists of nodes operated by biologically oriented centers in different European countries. It features a number of services and activities, especially genomic databases such as EMBL [Cameron88].
Why form a SIN
The following features characterize most large special interest networks. They also provide guidelines for setting one up.
- Need - The SIN serves a need that is not being met by other means, or provides a better (more comprehensive, accurate or reliable) set of data than is available from other sources.
- Coordination - a coordinating centre or syndicate organizes the network, receives and processes new entries, and communicates relevant news to its users.
- Support - There is a body of users who are willing and able to help to establish and manage the network's information activities (managing databases, editing publications, moderating newsgroups, mailing lists, etc.).
- Participation - Anyone may contribute items to the information base. Major SINs announce new entries via special newsgroups or mailing lists. Contributors carry out all editing of their entries, including formatting, correcting and updating them.
- Access - Anyone may access, copy or use the information at any time. Normally access is via a computing network using a standard protocol.
- Standards - Coordinating and exchanging information are possible only if different data sets are compatible with one another. To be reusable, data must conform to standards (e.g. [Croft89]). The need for widely recognized data standards and data formats is therefore growing rapidly. Given the increasing importance of network communications new standards should be compatible with network protocols.
- Quality control (see later) - Users need some guarantee that data provided in a database are both valid and accurate [Green94], [GC94]. Quality control checks can be applied by database contributors, coordinators, and users (see later).
- Attribution - Every item of information should include an indication of its contributor. This is essential to the notion that contributions are a form of publication.
- Agreements - There is an explicit list of terms and conditions. Typically, users agree to acknowledge the sources and to waive liability for any use they make of the data. The organizers agree to abide by the usual conditions for publications, such as referring corrections or changes to the contributors.
- Automation - as many operations as possible (e.g. logging and acknowledging submissions) should be automated. For example, a universal model (Fig. 1) applies to the publishing process, whatever the material may be. Most of the steps can be automated.
![]()
Fig. 1. Stages in the publication of information on a node of a SIN. As many steps as possible should be automated.
Coordination
An information system that is distributed over several sites (nodes) requires close coordination between the sites inolved. The coordinators need to agree on the following points:
- logical structure of the on-line information;
- separation of function between the sites involved;
- attribute standards for submissions (see below);
- protocols for submission of entries, corrections, etc.;
- quality control criteria and procedures (see below);
- protocol for on-line searching of the databases;
- protocols for "mirroring" the data sets.
For instance, an international database project might consist of agreements on the above points by a set or participating sites ("nodes"). Contributors could submit their entries to any nodes and each node would either "mirror" the others or else provide on-line links to them.
The information cycle
The use of information often falls into the following four-stage cycle of activities:SINs can assist at each stage of this cycle:
- asking questions,
- gathering relevant information,
- interpreting the information,
- disseminating the results.
- In the first stage, communication enables people concerned with a particular topic to stay in constant touch with the relevant user community. The benefits include the ability to relay questions and initiate discussion of issues essentially in real time; to enable those who need to ask questions to contact people able to answer those questions; to provide a forum where current issues can be discussed in a timely fashion; and to minimize unnecessary duplication of effort.
- In the information gathering stage, not only can users more effectively reach sources of relevant information, but they can also help each other by indexing any new resources that they may discover in the process or by adding fresh data items to existing repositories.
- In the interpretation stage, users may be able to access useful software, search bibliographies, or seek advice from colleagues.
- In the dissemination phase, users will be able to publish their results to a very wide audience very quickly. In scientific research these practices are already widespread in many fields (e.g. physics) and a growing number of on-line journals already exist on the Internet.
Organization
SINs can (and no doubt will) be organized in many different ways. However the scheme outlined below (using the example of running a public database) recommends mechanisms that are designed to distribute the workload, encourage participation and to accommodate growth:
- One node acts as a secretariat for the network.
- Each node serves some special function, such as acting as coordinating centre for one or more SIN projects, or acting as a regional centre.
- Each node mirrors a set of basic documents and/or menus that define the basic services offered by the SIN.
- Maintenance of each project and/or document is supervised by a coordinating centre (not necessarily the same for every activity).
- Material for publication may be submitted to any node (or perhaps to some subset).
- The coordinating centre for a given project regularly harvests incoming items from other nodes, carries out quality control procedures, and prepares updates.
- Each node carries out a mirroring operation regularly to retrieve up-to-date, local copies of new information from coordinating centres.
Many of the above steps will be automated. Whereas it is generally better to provide a pointer to the site that maintains an item of information, it is desirable to mirror any information (e.g. a "home" page for the SIN) that is frequently used, especially to reduce international traffic. Mirroring and is also desirable in case of disk crashes or breaks in network connections.
Quality control
Users need assurance that data is correct, that software works, and that articles contain valid information. Because anyone can open a network site and release anything they like, quality is not assured. Users therefore tend to refer to sites that act as an authoritative source or some other guarantee of quality. For this reason users usually prefer sites that are well-managed, well-organized, or belong to respected institutions.To ensure validity, molecular biology databases use the simple, but effective criterion of publication in a refereed journal. Many other approaches can be used. For example one might insist that a description of methodology accompany each data set that has not been published (say) in the scientific literature. Alternatively, a site might accept all contributions and categorize them on the basis of the evident quality of information.
Whatever criteria are used it is desirable to include indicators of reliability for the information in the attribute standard. Ideally every item of information should include a tag denoting accuracy or validity. Quality control fields need to include information about what error checks have been applied to ensure that the values have been recorded and entered correctly.
The compiling agent can apply consistency and outlier checks to filter out errors that may have been missed earlier [Green94]. If the data incorporate sufficient redundancy, then consistency checks can reveal many errors. Does location entered for a town lie on dry land, for instance? Outlier tests reveal unusual records that need to be checked. Both sorts of checks can be automated and are now routine for census data. They have recently been applied to herbarium records and other environmental data [Green94], [GC94].
The general publication procedure (Fig. 1) includes a quality control step. When a contribution is received the editor applies tests to ensure that the information conforms to the standard and to check for any obvious errors. For text material this quality control process might simply be a careful reading of manuscripts If any faults are detected, the information is returned to the source for correction. After this initial checking, new items are placed in an updates area (Fig. 1) and users are invited to submit comments about them. After suitable checks, and corrections by the contributor, the new entry is transferred to the database proper.
Distributed databases
An important activity of a SIN is for many sites to contribute to build a joint database that is searchable across the network. This is also an important way in which different SINs can cooperate with each other. For instance a SIN on (say) environment may link to another SIN dealing with (say) plant taxonomy or climate for relevant portions of its virtual library.A network database can have four different levels of distribution:
- Centralized - the entire database resides on a single server; other sites point to it. This is the most common form of network database.
- Distributed data, separate indices at each site - The database consists of several component databases, each maintained at different sites. A common interface (e.g. a WWW document) provides pointers to the components, which are queried separately. This form of loose integration is common using Gopher, WAIS and WWW.
- Distributed data, single centralized index - The data consists of many items, which are stored at different sites but accessed via a database of pointers maintained at a single site. Several forms of network indexing, such as Harvest, support this form of integration.
- Distributed data, multiple queries - many component databases are queried simultaneously across the network from a single interface. At present no common protocol publicly available supports such a flexible form of database integration, but it is possible to use proprietary software from a single supplier.
Network library
An important function of a special interest network is to provide a virtual library. That is, it should provide organized links to relevant information, wherever this information resides on the Internet. The biggest and best known virtual library is the World Wide Web Virtual Library.The logical design of the system could be based around major projects & themes and the library can be compiled and maintained in several ways:
- Members can submit "hotlists" of thematic pointers to a coordinating centre for editing;
- An automatic registration service (e.g. via email or as a WWW form) can be available for people to submit relevant links information, which is then processed by scripts on a network server.
The above information could be made available via a series of menus and pages available on the Internet via Gopher, World Wide Web and other suitable protocols. Copies of the main pages and hierarchy of documents could be available at each node in the network.
This will require a regular "mirroring" process to ensure that all nodes are kept up to date. It is very important to ensure that all information items in this library are visible at all nodes and not just visible as an isolated reference at a particular site.
Network publishing
Network publications can range from familiar paper items - books, journals, news magazines - that are simply transferred to electronic form to novel productions, such as image databases or thematic compilations of pointers to items stored at many different sites.An important principle in network publication is that the site that maintains an item of information publishes the information. This rule applies especially to items that are updated regularly. Secondary sources (other sites that want to provide their users with access to the item concerned) should adopt one of two options: either provide a link to the primary site, or else mirror the original by downloading copies at regular intervals. These practices ensure that users always have access to the most up-to-date information available.
One approach to publishing that a SIN can adopt is simply to register relevant existing activities. This benefits both the SIN as a whole and the publishing site:
- individual sites can gain an international "stamp of approval", and world-wide collaboration, for particular projects by having them recognized by the SIN;
- a SIN can incorporate many different projects, each supervised by a separate node, and no single agency needs to bear the full burden for any particular project.
- a SIN or site can continue to focus on its own particular area of specialization or expertize and still provide access to information held at other sites.
Automation
Automation is a key element in making a SIN viable. The aim is to reduce the workload and human involvement in creating and maintaining information, and hence costs, for participating nodes. For example, publishing submitted material (whether text, data, images etc) involves several steps (Fig. 1). As many as possible of these steps should be automated. For instance, storing, registering and acknowledging incoming material are routine procedures that are time-consuming if done "by hand".Once the necessary scripts and programs have been developed, they could be provided with other standard files as a startup package to new nodes. In many cases the scripts and programs needed to automate particular procedures already exist and are freely available on the Internet.
Commercial activity The SIN model accommodates commercial activity in a very natural way. It is in any company's interests to align itself with networks whose topics areas relate to the company's business. A pharmaceutical company, for instance, would have natural links with networks dealing with health, medicine or similar topics. This again is the honeypot effect in action: a SIN has the desirable property of channelling interested users very efficiently to groups who offer the most relevamt information or services.
Companies can participate in a SIN in at least two ways. First they can provide information or services that directly "add value" to the SIN. Alternatively they could simply use the SIN as a honeypot and provide a link to their commercial information as an on-line commercial service.
This latter approach provides an important potential mechanism for funding SINs. Just as companies pay for advertising on TV or public transport, so they could pay to have links as "on-line services" in relevant SINs. This approach would be ideal for niche marketing, especially for highly specialized areas.
Demonstrations
The following organizations that have adopted explicitly the formal model outlined above. Most of them are still under development. The "SIN of SINs" is a development body that aims to develop the SIN model. The "Australia Index" is a joint project that aims to index all Australian information, and all information about Australia. The other examples listed here concern environmental information.
- The Australia Index
- /firenet/australia/
- The SIN of SINs
- http://life.csu.edu.au/sin/
- Australian Environmental Network
- http://life.csu.edu.au/aenet/
- Biodiversity Information Network (BIN21)
- http://life.csu.edu.au/bin21/
- FireNet
- /firenet/
- International Organization for Plant Information (IOPI)
- http://iopi.csu.edu.au/iopi/
The Australia Index, for example, is a collaborative project that aims to develop a comprehensive on-line index of Australian information, and information about Australia and to make it available on the Internet. The Index will include comprehensive, searchable databases as well as structured guides and interpreted views ("virtual libraries") of the information aimed at particular needs, especially education, tourism, research, government and commerce.
Conclusion
The notion of SIN as described here derives from three sources. First, as manager of a network information server I was prompted to develop the idea after observing the ways in which various sites had begun to coordinate their activities on particular topics. It seemed to me that SIN have the potential to fill both the role of learned societies as authoritative bodies, and of libraries as stable repositories of knowledge and information.Second, the evident success of molecular biology databases and physics preprint services suggests that the underlying principles can be extended to many fields of information and activity.
Finally there is the problem of how to organize an exploding pool of information on the network. Librarians have struggled with this problem for centuries. Whilst they have evolved many workable solutions, the explosion of information on the Web poses problems never encountered before: the sheer volume of information, rapid turnover and change (especially the need to maintain information), and the flexibility of hypertext and multimedia. But most of all the Web blurs the distinction betweeen information suppliers, distributors and users. The SIN model provides a user-driven solution, in which groups of people interested in a particular topic organize and index information in ways that they find most useful. The Twenty-First Century will surely become the era of the knowledge web. SINs, in whatever form they may take, will play a major role in its organization.
References
- [Bilofsky88]
- Bilofsky, H. S. & Burks, C. (1988). The GenBank genetic sequence data bank. Nucl. Acids Res. 16: 1861-1863.
- [Burdet92]
- Burdet, H. M. (1992). What is IOPI? Taxon 41: 390-392.
- [Cameron88]
- Cameron, G. N. (1988). The EMBL data library. Nucl. Acids Res. 16: 1865-1867.
- [Canhos92]
- Canhos, V., Lange, D., Kirsop, B.E., Nandi, S., Ross, E. (Eds). (1992). Needs and Specifications for a Biodiversity Information Network. United Nations Environment Programme, Nairobi.
- [Croft89]
- Croft, J.R. (1989). Herbarium information standards and protocols for interchange of data. Australian National Botanic Gardens, Canberra.
- [Goldfarb90]
- Goldfarb, C. (1990). The SGML Handbook. Oxford: Oxford University Press.
- [Green94]
- Green, D.G. (1994). Databasing diversity - a distributed, public-domain approach. Taxon 43, 51-62.
- [GC94]
- Green, D.G. and Croft, J.R. (1994). Proposal for Implementing a Biodiversity Information Network. In Linking Mechanisms for Biodiversity Information. Proceedings of a Workshop for the Biodiversity Information Network, Base de Dados Tropical, Campinas, Sao Paulo, Brasil.
- [GGT93]
- Green, D.G., Gill, A.M. & Trevitt, A.C.F. (1993). FIRENET - an international network for landscape fire information. Wildfire - Quarterly Bulletin of the International Association of Wildland Fire 2(4), 22-30.
- [Krol92]
- Krol, E. (1992). The Whole Internet Guide and Catalog. O'Reilly and Associates.
- [SS88]
- Smith, J. & Stutely, R. (1988). SGML: the Users' Guide to ISO 8879. New York/Chichester/Brisbane/Toronto: Ellis Horwood Limited/Halstead Press.
COPYRIGHT © 1995 by AUUG95 and APWWW95 Charles Sturt University. ALL RIGHTS RESERVED.