David Hatherly
School of Information Technology,
Charles Sturt University
Email: dhatherly@csu.edu.au
Terry Bossomaier
School of Information Technology,
Charles Sturt University
Email: tbossomaier@csu.edu.au
This paper addresses an alternative document filing system, with no ambiguity, which has the flexibility to be used in any type of organisation and can be used across the Internet to provide an integrated document tracking system for distributed organisations.
The paper will explore the use of SGML (Standard Generalised Markup Language) for the specification and storage of meta data for documents; the world wide web for distributed user interaction; and CGI (Common Gateway Interface) software for the management of document archives.
One of the problems with traditional filing systems is the subjective choice of categories within which documents are classified when they are being stored. Often, perhaps usually, a single person examines a given document and chooses the best match for it from a list of previously defined classifications. It is also common that the person who chooses the classification does so based on their own undocumented opinions about the best or most likely category for the document. In some cases, one clerical officer who does the storing may not have the same interpretation of the nature and content of the document as another person who needs to retrieve it. While it is true that in many cases content specialists advise clerical workers on the 'best' category, this level of formality and consistency may not be present in all organisations nor across agencies of distributed organisations. The situation may be exacerbated by changes in staff or changes in levels of responsibility of staff.
Distributed organisations suffer further when the same documents are received by more than one agency. In such situations, not only are the documents duplicated, but the classification used is likely to be different for different agencies.
As a simple example, a printed report relating to proposed funding of a new advertising campaign may be deemed by one person to be best filed under 'reports'; by another under 'funding'; by another under 'advertising'; and so on. The result of such ad hoc schemes is ambiguity.
Although formal classification schemes used by document specialists (Dewey system in libraries, etc.) are generally successful in removing ambiguity, they are typically not used in autonomous commercial organisations due to their complexity and the overhead of staff training.
Such systems provide immediate improvements in efficiency over manual systems but the standard DBMS approach remains compromised in a number of significant ways:
By creating a document tracking system which uses SGML, we can retain the major benefit of computerised filing systems (lack of ambiguity of classification) and remove some of the deficiencies of the traditional and DBMS approaches to build a system which allows for user-defined information about any document to be stored intact, then indexed, searched and retrieved with software tools from any vendor. The document records are also sensible to human readers. There are software tools (parsers) which can rigorously check our structure definitions and our records (instances in SGML) to make sure they adhere the defined standard.
An SGML-based document tracking system allows an organisation to define a suitable set of meta data (information about the documents) for documents they wish to track, and to specify the structure of the meta data in a rigorous way. This is done with the SGML Document Type Definition (DTD). A detailed explanation of DTDs is beyond the scope of this paper, but a simple example is given here:
If the meta data we wish to store for tracking local documents consists of just author, title and date, we can specify the order and occurrence of these in a DTD as:
<!DOCTYPE LOCALDOC [ <!ELEMENT LOCALDOC - - (AUTHOR+, TITLE, DATE?)> <!ELEMENT (AUTHOR, TITLE, DATE ) - - (#PCDATA)> ]>
The above example, although simple, shows that a LOCALDOC instance contains elements consisting of at least one AUTHOR, a single compulsory TITLE and, at most, one (and optional) DATE. Furthermore, these elements, must appear in the order specified in the DTD. The elements themselves in this case, simply contain parsable character data - see 'A Gentle Introduction to SGML' for an introduction to SGML.
A sample instance of a LOCALDOC is shown here:
<LOCALDOC> <AUTHOR>Monty Burns - Manager</AUTHOR> <TITLE>Funding of Springfield Advertising Campaign</TITLE> <DATE>July 20, 2003</DATE> </LOCALDOC>
The instance is in plain text, and the straightforward markup allows manipulation by commercial or custom software tools for the purposes of searching or rendering to a variety of output devices. The style of markup is, of course, easily recognised by anyone who is familiar with the World Wide Web. Web documents are instances of another SGML DTD - HTML.
Elements can also have attributes. These enable the markup to contain information which may or may not be part of the actual data but which can add value to the semantics of the instance. As a final touch to the example, the LOCALDOC document type definition could be modified to include an attribute for element DATE, as follows:
<!ATTLIST DATE ASSIGNED ( WRITTEN | RECEIVED ) RECEIVED>
The ASSIGNED attribute here can be used to signal whether the date recorded in the instance was the date which the document was written or that on which it was received; and an instance could thus be produced as follows:
<LOCALDOC> <AUTHOR>Monty Burns - Manager</AUTHOR> <TITLE>Funding of Springfield Advertising Campaign</TITLE> <DATE ASSIGNED="RECEIVED">July 20, 2003</DATE> </LOCALDOC>
By completing a Web form, any user can input meta data for a document at their local machine. This data is delivered to the Web server by the browser and can be collected by a program running on the server using the Common Gateway Interface (CGI). The instances produced can be stored on the Web server (or written to some other machine) where they become available for searching and retrieval.
The document tracking system addressed in this paper uses this technique. It combines into a working application the benefits of computerised filing, SGML specification for documents, the World Wide Web user interface, the CGI mechanism and some readily available indexing tools.
The document tracking system (DTS) input component is managed by a CGI program which can read and interpret an SGML DTD. Each organisation can create a DTD to define the information they wish to record about their documents and make this DTD available to the DTS input program in a simple configuration file. They then create a web input form to match the elements of the DTD, using straightforward HTML markup. Thus, customisation of the document tracking system is possible without modification to the CGI program itself. Similarly, the retrieval of documents can also be freely configured to match any customised DTD. The specifics of the configuration methods are covered in the DTS documentation.
Using a Web form for input provides some benefits for distributed organisations, but it also brings some limitations. The naming conventions in Web forms mean that the instance must be reasonably 'flat' in structure since there is a simple mapping between input field names and the elements/attributes. Methods to overcome this limitation are currently being developed and later versions of DTS will provide for more flexibility in the content models for any element in the DTD.
On an ordinary (initial) call to DTSLIST, the user is presented with a sequential list of stored documents showing the sequence number, summary information and date and time stamp. This list is rendered with each document sequence number as a hypertext link.
Linking through any sequence number calls DTSLIST with the selected sequence number as an argument and produces a display of the required document rendered as a simple textual outline.
The use of SGML as described above allows descriptors for a document to be identified: to complete the package, a means of searching those descriptors and retrieving documents is needed.
Just as the Web provides a means for document entry, there are a range of possible Web tools available for indexing and retrieval; Webster and Paul provide an extensive 'Webliography' containing information and links to many of these.
The following criteria were deemed to be important in choosing between them:
Harvest, produced by the University of Colorado, was chosen as it met the following criteria:
Harvest can be viewed as comprising four related subsystems. They are:
attribute_name{length}: attribute_value
Thus, for the simple SGML document descriptor given above, and assuming that each SGML tag becomes a SOIF attribute of the same name, the corresponding SOIF object is:
author{21}: Monty Burns - Manager
title{43}: Funding of Springfield Advertising Campaign
date{13}: July 20, 2003
The DTS system is currently being extended in a number areas. One of these is being trialed across three of the campuses of Charles Sturt University (Albury, Wagga Wagga and Bathurst) as distributed agencies with common document tracking.
Another extension concerns SGML software tools. A structured SGML outliner/editor package is currently being developed by one of the authors (Errol Chopping) and as this comes to a working stage, a number of its features will be employed in the DTS package.
Other extensions involve the user interface subsystem. The current document retrieval form is generic, having been designed for unstructured documents. It is proposed to modify the form and associated cgi scripts to reflect the known structure of our standard DTD (for example, allowing users to select element names from a 'drop down' list).
Further extensions will allow the system to incorporate new document types. Currently the user must carry out a number of manual tasks to incorporate new document types into the system - the gatherer must be told where to find the file type, and the summariser must be given the DTD, told how to identify documents of that type and which elements to extract from them. An appropriate retrieval form and associated cgi script must then be developed. It is hoped that many of these tasks will be able to be automated (perhaps dynamically from the supplied DTD). This final stage awaits the widespread adoption of a facility such as Java to allow forms to be modified as data is entered into them.
This subsystem is highly customisable. The user is able to specify the access methods used to retrieve files (for example, FTP, Gopher, HTTP, etc.) and the locations which will be searched (either as directory names or as URLs). The user is also able to specify which file types are to be included in the search, by reference either to the file's name or to its contents.
In the case of the Document Tracking system, we used the file name as our determinant, specifying that only files having a suffix of '.dts' were to be considered.
This subsystem is also configurable but, for our application, the aspect of particular interest is the handling of SGML files. The user must supply information to allow the file type to be determined. In our case, by means of the suffix '.dts' (as described above), the DTD, a mapping table that specifies which of the SGML tags in the files are to be extracted to SOIF, and the attribute name by which each is to be known. Files are parsed according to their DTD, with required SGML tags being mapped to SOIF attributes.
In the case of the Document Tracking System, we have chosen to map each SGML tag to a SOIF attribute of the same name. In addition to the user-specified attributes, the summariser adds a range of other attributes such as file name, file size and type.
The standard Harvest distribution uses the Glimpse indexing engine although others, such as WAIS, can be used if required, and it is even possible to substitute a relational database. We opted to use Glimpse since it was free, it was more flexible than WAIS (allowing matching on or off word boundaries, and also allows for misspellings) and, since it was part of the standard distribution, less effort was required on the part of implementors.
Glimpse is also configurable (with the major considerations being the size of the index being created, and the percentage of "common" words to be left out of the index), although in our system we have not needed to change the default values.
To overcome this difficulty, the Harvest distribution includes a Web form and associated cgi scripts. These allow the user to enter the search criteria (either as an attribute name and text to be sought in that attribute - for example title : springfield - or as text to be sought anywhere in the stored document), which is passed to the script for validity checking. Valid requests are passed to the indexing subsystem which returns data from the SOIF object matching the request. The script then formats this data and displays it to the user, including URL links to each SOIF object.
Return to Conference Proceedings