A Scalable, Extensible Document Tracking System

Errol Chopping
School of Information Technology,
Charles Sturt University

Email: echopping@csu.edu.au

David Hatherly
School of Information Technology,
Charles Sturt University

Email: dhatherly@csu.edu.au

Terry Bossomaier
School of Information Technology,
Charles Sturt University

Email: tbossomaier@csu.edu.au

Abstract

Institutions receive and generate a variety of different documents which are traditionally filed according to their subject matter. Unfortunately, such classification schemes are problematic and often lead to ambiguity in document storage and retrieval.

This paper addresses an alternative document filing system, with no ambiguity, which has the flexibility to be used in any type of organisation and can be used across the Internet to provide an integrated document tracking system for distributed organisations.

The paper will explore the use of SGML (Standard Generalised Markup Language) for the specification and storage of meta data for documents; the world wide web for distributed user interaction; and CGI (Common Gateway Interface) software for the management of document archives.

1 Traditional document management systems

Organisations depend on documents for general management and communication and have traditionally used standard manual filing systems for storage. In large organisations, significant quantities of paper-based and electronic documents need to be filed for later retrieval and reference. The importance of accurate document storage and retrieval is obvious. For organisations which have branches or agencies in different locations, the need for some common filing system is imperative.

One of the problems with traditional filing systems is the subjective choice of categories within which documents are classified when they are being stored. Often, perhaps usually, a single person examines a given document and chooses the best match for it from a list of previously defined classifications. It is also common that the person who chooses the classification does so based on their own undocumented opinions about the best or most likely category for the document. In some cases, one clerical officer who does the storing may not have the same interpretation of the nature and content of the document as another person who needs to retrieve it. While it is true that in many cases content specialists advise clerical workers on the 'best' category, this level of formality and consistency may not be present in all organisations nor across agencies of distributed organisations. The situation may be exacerbated by changes in staff or changes in levels of responsibility of staff.

Distributed organisations suffer further when the same documents are received by more than one agency. In such situations, not only are the documents duplicated, but the classification used is likely to be different for different agencies.

As a simple example, a printed report relating to proposed funding of a new advertising campaign may be deemed by one person to be best filed under 'reports'; by another under 'funding'; by another under 'advertising'; and so on. The result of such ad hoc schemes is ambiguity.

Although formal classification schemes used by document specialists (Dewey system in libraries, etc.) are generally successful in removing ambiguity, they are typically not used in autonomous commercial organisations due to their complexity and the overhead of staff training.

2 Computerised document tracking

2.1 The database approach

Computerised filing systems have a number of advantages over manual systems. Typically a user enters information about a given document into a database management system. As part of the data entry process, the DBMS will assign a unique identifier to each document record. This identifier may be used as an indexed key for retrieval purposes and can also be used to annotate actual documents when they are being stored. When document retrieval is required, a user may enter an identifier, if it is known. In a more likely situation, a user will enter criteria which the DBMS uses to retrieve a bounded record subset from which the user selects the records(s) required.

Such systems provide immediate improvements in efficiency over manual systems but the standard DBMS approach remains compromised in a number of significant ways:

2.2 The SGML approach to document tracking

Standard Generalised Markup Language provides a definitive method of specifying the structure of a class of documents. It is emerging as the standard and the most promising method of describing and storing documents. A small but useful reference for SGML can be found in Jeffery Spivak's book 'The SGML Primer' (Spivak 1996) and there is a comprehensive coverage in the course notes available from (Exoterica 1996). For a good coverage of text (and hypertext) standards in general, see DeRose 1995.

By creating a document tracking system which uses SGML, we can retain the major benefit of computerised filing systems (lack of ambiguity of classification) and remove some of the deficiencies of the traditional and DBMS approaches to build a system which allows for user-defined information about any document to be stored intact, then indexed, searched and retrieved with software tools from any vendor. The document records are also sensible to human readers. There are software tools (parsers) which can rigorously check our structure definitions and our records (instances in SGML) to make sure they adhere the defined standard.

An SGML-based document tracking system allows an organisation to define a suitable set of meta data (information about the documents) for documents they wish to track, and to specify the structure of the meta data in a rigorous way. This is done with the SGML Document Type Definition (DTD). A detailed explanation of DTDs is beyond the scope of this paper, but a simple example is given here:

2.2.1 SGML implementation example

If the meta data we wish to store for tracking local documents consists of just author, title and date, we can specify the order and occurrence of these in a DTD as:

<!DOCTYPE LOCALDOC [
<!ELEMENT LOCALDOC - - (AUTHOR+, TITLE, DATE?)>
<!ELEMENT (AUTHOR, TITLE, DATE ) - - (#PCDATA)>
]>

The above example, although simple, shows that a LOCALDOC instance contains elements consisting of at least one AUTHOR, a single compulsory TITLE and, at most, one (and optional) DATE. Furthermore, these elements, must appear in the order specified in the DTD. The elements themselves in this case, simply contain parsable character data - see 'A Gentle Introduction to SGML' for an introduction to SGML.

A sample instance of a LOCALDOC is shown here:

<LOCALDOC>
<AUTHOR>Monty Burns - Manager</AUTHOR>
<TITLE>Funding of Springfield Advertising Campaign</TITLE>
<DATE>July 20, 2003</DATE>
</LOCALDOC>

The instance is in plain text, and the straightforward markup allows manipulation by commercial or custom software tools for the purposes of searching or rendering to a variety of output devices. The style of markup is, of course, easily recognised by anyone who is familiar with the World Wide Web. Web documents are instances of another SGML DTD - HTML.

Elements can also have attributes. These enable the markup to contain information which may or may not be part of the actual data but which can add value to the semantics of the instance. As a final touch to the example, the LOCALDOC document type definition could be modified to include an attribute for element DATE, as follows:

<!ATTLIST DATE  ASSIGNED ( WRITTEN | RECEIVED ) RECEIVED>

The ASSIGNED attribute here can be used to signal whether the date recorded in the instance was the date which the document was written or that on which it was received; and an instance could thus be produced as follows:

<LOCALDOC>
<AUTHOR>Monty Burns - Manager</AUTHOR>
<TITLE>Funding of Springfield Advertising Campaign</TITLE>
<DATE ASSIGNED="RECEIVED">July 20, 2003</DATE>
</LOCALDOC>

2.3 The WWW as a user interface

The presentation of documents to distributed clients has been successful with the World Wide Web. The documents are marked up in a non-proprietary way and are viewable by any number of software packages. Client software packages (Web browsers) also allow users to interact with documents in a relatively consistent way. A document tracking system which is targeted at distributed agencies can use this consistency and platform independence to provide for both data input and retrieval.

By completing a Web form, any user can input meta data for a document at their local machine. This data is delivered to the Web server by the browser and can be collected by a program running on the server using the Common Gateway Interface (CGI). The instances produced can be stored on the Web server (or written to some other machine) where they become available for searching and retrieval.

The document tracking system addressed in this paper uses this technique. It combines into a working application the benefits of computerised filing, SGML specification for documents, the World Wide Web user interface, the CGI mechanism and some readily available indexing tools.

The document tracking system (DTS) input component is managed by a CGI program which can read and interpret an SGML DTD. Each organisation can create a DTD to define the information they wish to record about their documents and make this DTD available to the DTS input program in a simple configuration file. They then create a web input form to match the elements of the DTD, using straightforward HTML markup. Thus, customisation of the document tracking system is possible without modification to the CGI program itself. Similarly, the retrieval of documents can also be freely configured to match any customised DTD. The specifics of the configuration methods are covered in the DTS documentation.

3 DTS input and list operations

The material given here is not a technical description of the inner workings of the DTS but rather a guide to the fundamental operations it performs. A detailed description is given in the DTS package documentation.

3.1 Development of DTD

An organisation either decides to create its own specification for the meta data for documents or uses a working specification provided with the DTS package. In either case, the specification of meta data results in an SGML DTD which must be included in the DTS configuration file. A DTD for use with this document tracking system must include a SEQ (sequence number) element and a DATE (time and date stamp) element. It is assumed that, if a custom DTD is to be used, it will be checked for validity by an SGML parser - (for example, Omnimark).

3.2 Creating a web form - naming input fields

A web form for data input is created in which the names of the input fields correspond to the elements and attributes specified in the DTD. There is no restriction on the location of the Web form; it can be placed anywhere under the document root of the web server software.

Using a Web form for input provides some benefits for distributed organisations, but it also brings some limitations. The naming conventions in Web forms mean that the instance must be reasonably 'flat' in structure since there is a simple mapping between input field names and the elements/attributes. Methods to overcome this limitation are currently being developed and later versions of DTS will provide for more flexibility in the content models for any element in the DTD.

3.3 User interaction - data input

To input data about a document, a user completes the Web form and submits it. The DTSINPUT program then:

3.4 Listing instances

The DTSLIST program provides a simple browsing and viewing capability to the DTS.

On an ordinary (initial) call to DTSLIST, the user is presented with a sequential list of stored documents showing the sequence number, summary information and date and time stamp. This list is rendered with each document sequence number as a hypertext link.

Linking through any sequence number calls DTSLIST with the selected sequence number as an argument and produces a display of the required document rendered as a simple textual outline.

4 Analysis, indexing and retrieval of documents

Storing documents is only of use if there is some way to retrieve them. As has been stated above, traditional filing systems either use subjective and often inaccurate classification data (such as might be used by an individual), or require expensive infrastructure and trained personnel to administer them (as is the case with a library catalog).

The use of SGML as described above allows descriptors for a document to be identified: to complete the package, a means of searching those descriptors and retrieving documents is needed.

Just as the Web provides a means for document entry, there are a range of possible Web tools available for indexing and retrieval; Webster and Paul provide an extensive 'Webliography' containing information and links to many of these.

The following criteria were deemed to be important in choosing between them:

Harvest, produced by the University of Colorado, was chosen as it met the following criteria:

Harvest can be viewed as comprising four related subsystems. They are:

4.1 SOIF

Central to Harvest's operation is the notion of a Summary Object Interchange Format (SOIF) which is used to store summary information from each document descriptor as a series of entries having the form:

attribute_name{length}: attribute_value

Thus, for the simple SGML document descriptor given above, and assuming that each SGML tag becomes a SOIF attribute of the same name, the corresponding SOIF object is:

author{21}: Monty Burns - Manager
title{43}: Funding of Springfield Advertising Campaign
date{13}: July 20, 2003

5 Current DTS updates and trialing

5.1 Updates to DTS input and list

The DTS, as described in this paper, has been developed and trialed in the School of Information Technology at Charles Sturt University in Bathurst. The processing speed of the programs has been good and within the limitations of the Web form interface, the programs provide a valuable and accurate tracking system for the School.

The DTS system is currently being extended in a number areas. One of these is being trialed across three of the campuses of Charles Sturt University (Albury, Wagga Wagga and Bathurst) as distributed agencies with common document tracking.

Another extension concerns SGML software tools. A structured SGML outliner/editor package is currently being developed by one of the authors (Errol Chopping) and as this comes to a working stage, a number of its features will be employed in the DTS package.

5.2 Updates to analysis, indexing and retrieval

A number of extensions to the basic Harvest software are proposed. Currently the gathering, summarising and indexing subsystems are activated once per day, but future plans are to have them performed each time a document summary is stored. This would have the advantage of making documents immediately retrievable, rather than the current delay of up to 24 hours.

Other extensions involve the user interface subsystem. The current document retrieval form is generic, having been designed for unstructured documents. It is proposed to modify the form and associated cgi scripts to reflect the known structure of our standard DTD (for example, allowing users to select element names from a 'drop down' list).

Further extensions will allow the system to incorporate new document types. Currently the user must carry out a number of manual tasks to incorporate new document types into the system - the gatherer must be told where to find the file type, and the summariser must be given the DTD, told how to identify documents of that type and which elements to extract from them. An appropriate retrieval form and associated cgi script must then be developed. It is hoped that many of these tasks will be able to be automated (perhaps dynamically from the supplied DTD). This final stage awaits the widespread adoption of a facility such as Java to allow forms to be modified as data is entered into them.

6 Appendix - the Harvest Information Discovery and Access System

6.1 Gatherer subsystem

The gatherer is responsible for retrieving document descriptors from target sites.

This subsystem is highly customisable. The user is able to specify the access methods used to retrieve files (for example, FTP, Gopher, HTTP, etc.) and the locations which will be searched (either as directory names or as URLs). The user is also able to specify which file types are to be included in the search, by reference either to the file's name or to its contents.

In the case of the Document Tracking system, we used the file name as our determinant, specifying that only files having a suffix of '.dts' were to be considered.

6.2 Summariser subsystem

The summariser is responsible for analysing the retrieved document descriptors to extract the required fields and converting them to SOIF objects. The gatherer has sets of default actions for a wide range of file types. For example, for text files, all words in the first 100 lines and then the first sentence of each paragraph are extracted, while for object files the symbol table is extracted.

This subsystem is also configurable but, for our application, the aspect of particular interest is the handling of SGML files. The user must supply information to allow the file type to be determined. In our case, by means of the suffix '.dts' (as described above), the DTD, a mapping table that specifies which of the SGML tags in the files are to be extracted to SOIF, and the attribute name by which each is to be known. Files are parsed according to their DTD, with required SGML tags being mapped to SOIF attributes.

In the case of the Document Tracking System, we have chosen to map each SGML tag to a SOIF attribute of the same name. In addition to the user-specified attributes, the summariser adds a range of other attributes such as file name, file size and type.

6.3 Indexing subsystem

The indexer is responsible for two tasks - adding new SOIF objects to an index, and object records according to query specifications.

The standard Harvest distribution uses the Glimpse indexing engine although others, such as WAIS, can be used if required, and it is even possible to substitute a relational database. We opted to use Glimpse since it was free, it was more flexible than WAIS (allowing matching on or off word boundaries, and also allows for misspellings) and, since it was part of the standard distribution, less effort was required on the part of implementors.

Glimpse is also configurable (with the major considerations being the size of the index being created, and the percentage of "common" words to be left out of the index), although in our system we have not needed to change the default values.

6.4 User interface

As mentioned in the previous section, the indexing subsystem is responsible for answering queries. Raw Glimpse (despite its power) is not an acceptable user interface, nor are other common indexing packages.

To overcome this difficulty, the Harvest distribution includes a Web form and associated cgi scripts. These allow the user to enter the search criteria (either as an attribute name and text to be sought in that attribute - for example title : springfield - or as text to be sought anywhere in the stored document), which is passed to the script for validity checking. Valid requests are passed to the indexing subsystem which returns data from the SOIF object matching the request. The script then formats this data and displays it to the user, including URL links to each SOIF object.

References

1
Spivak, J. (1996) The SGML Primer, ITP, Sydney.

2
Allette Systems (Australia) (1996) Exoterica's Comprehensive SGML Course, Course Notes, Allette Systems, Sydney, Australia.

3
Omnimark (1996), Exoterica Inc, Canada.

4
Sperberg-McQueen, C.M. & Burnard, Lou (1996) A Gentle Introduction to SGML, Electronic Text Encoding and Interchange (TEI P3)
http://info.ox.ac.uk/~archive/teip3sg/

5
DeRose, S. (1995) Standards Update SIGLINK Newsletter March, 4:(1):12-18.

6
Webster, K. and Paul, K. (1996) Beyond Surfing: Tools and Techniques for Searching the Web Information Technology Column, CLA Emerging Technologies Interest Group.
http://magi.com/~mmelick/it96jan.htm

7
Hardy, D. R., Schwartz, M. F. & Wessels, D. (1996) The Harvest Information Discovery and Access System.
http://harvest.cs.colorado.edu/

8
Glimpse (1994) The Glimpse index/search subsystem.
http://glimpse.cs.arizona.edu.:1994/glimpsehelp.html

9
Fullton, J., Gamiel, K. & Warnock, A. (1995) The Wide Area Information Server system.
ftp://ftp.cnidr.org/pub/NIDR.tools/freewais


Organised by: AUUG'96 & CSU Return to Conference Proceedings