Data Management API : the standard and implementation experiences

Alex Miroshnichenko
Sr. Member of Technical Staff,
Veritas Software Corp.,
1600 Plymouth Street,
Mountain View, California, 94043, USA
Phone (415)335-8480
Fax (415)335-8050
Email: alex@veritas

Abstract

Today's computing environments are characterised by the ever-increasing demand for data storage capacity. Large amounts of data are stored on UNIX-based server computers and the costs associated with managing the storage subsystems have been significantly higher than the cost of the storage itself. It has been estimated that for each dollar spent in purchasing on-line disk storage capacity, about seven dollars will be spent managing this storage each year. There is an ongoing need for intelligent and efficient storage management.

Over the years, a wide variety of data management applications has been developed, including various hierarchical storage management applications, (data migration applications), unattended backup and data recovery, various on-line data compression schemes. We can also include in this category various enhanced data security applications (automatic file data encryption). Later in this paper, such data management application are referred to as DM applications.

1 Introduction

All these applications work on the assumption that the working set of actively accessed data at any given moment is significantly smaller than the total amount of data available. These applications migrate the data between fast on-line storage of limited capacity and the tertiary storage archive. At the same time, they provide the on-line semantics for all the data on the tertiary archive; that is, the users do not perform any administrative operations in order to access the data; the application recognises the access to the archived data and automatically brings the data to the user.

Data management application vendors who attempt to implement such storage management application in UNIX-compatible environments often find themselves in a difficult situation. In order to implement the functionality required by an application, some monitoring facilities must be provided by the operating system kernel. For instance, a data management application may need to be notified when a user attempts to read a block of data from a file. Also, some interfaces beyond the standard POSIX systems calls are often required; for instance, capabilities to read the data from a file without modifying the file access time.

To resolve these and other problems, data management application vendors often implemented their own modification to the operating system kernel. There are several fundamental problems associated with this approach:

2 DMAPI

The Data Management Interfaces Group (DMIG) is a group of operating system, data management software and data management application vendors that has been working since 1993 to develop a set of Application Programming Interfaces (DMAPI) that would allow data management applications to be developed much like ordinary software applications, without the need for the third-party application vendors to modify the OS kernel. The group has included representatives from Silicon Graphics, Sun Microsystems, Hewlett-Packard, Veritas Software, EMC Corporation, EMASS, Legato, IBM, Netstore, Cheyenne, NASA-Ames, Hitachi, Legent, Transarc. This is not a full list of companies that have participated in the work of DMIG; given the ongoing consolidation of the computer and software industry it is hard to keep this list up-to-date.

By the fall of 1995, DMIG had produced a Data Management API (DMAPI) specification. It defines interfaces to be implemented by the operating system vendors and used by the data management application vendors.

DMAPI benefits operating system vendors, third-party application vendors and users alike by providing a consistent, platform-independent interface for development of data management applications. Application vendors now need to support only one interface for all platforms; the interface does not change from one release to another and it is the same across all supported platforms. The data management application may be developed independently from the OS release cycles. File system and operating system vendors need to implement only one interface. End users can ask for 'DMAPI-compliant' application as it gives them better understanding of the application functionality and gives certain guarantees about the quality of the data management solution because the kernel component comes from the operating system vendor.

DMAPI is targeted to provide an environment which is suitable for implementing robust, commercial-grade data management applications; it includes facilities for data management application crash recovery and stateful control of the file system objects.

3 Interface overview

This section presents a basic overview of DMAPI concepts with the emphasis on non-obvious issues, primarily based upon the author's experience as the interface implementor. One of the deceiving features of DMAPI is its apparent simplicity; the interface specification introduces a few dozen new library calls and a few data structures. Because it an application programming interface, users of the DMAPI often tend to overlook the fact that a DM application itself is an extension of the operating system storage management subsystem and therefore should be designed and implemented with care and attention usually given to kernel code. DMAPI simplifies the design of portable robust DM applications; it does not eliminate the need to have a design in the first place.

Further discussion often mentions specific DMAPI functions. It would be useful for the reader to have a DMAPI specification available for reference; the information on obtaining the specification document is given at the end of the paper.

3.1 Handles

DMAPI is a handle-based interface, all file system objects represented by handles. A handle is an opaque persistent identifier which is unique per host. Handles are represented as variable length byte streams. An existence of a valid handle does not guarantee existence of the object referenced by it. Note, that similar situation exists in NFS when a client application may receive stale handle' error if the object referred by the NFS handle was removed on the server. The only way a DM application may guarantee the existence of the object referred to by a handle is to either obtain a right or hold on the object. Besides regular file system object handles, DMAPI assigns a generic handle to each mounted file system. There is also one global handle per host. Different handles may refer to the same object, the only way to find it out is to compare them using the dm_handle_cmp() call.

DMAPI provides several ways to link handles with the file system name space. The simplest example is a dm_path_to_handle() call; given a pathname, it returns the handle for the object - it is worth mentioning that it does not follow symbolic links. Certain DMAPI event types (see below) deliver both object handle and object pathname component. There is also a dm_handle_to_path() call; given a parent directory handle and an object handle, it attempts to reconstruct a full path name to the object. Obviously, such path names may not be unique and may not be valid as the name space changes during the execution of this call.

As has been mentioned earlier, handles are opaque; DM application is not supposed to interpret the contents of byte streams representing handles. However, after some debate, a number of so-called legacy handle interfaces have been introduced. These interfaces are intended for easier integration of some old DM application. Such applications often use file inode numbers as object handle equivalents and file system device numbers as generic handles. At the request of several DM application vendors, DMAPI included functions to construct handles out of file inode number and generation count and file system device number. There are also functions to 'extract' these components from DMAPI handles. These legacy interfaces are optional; their usage is strongly discouraged, and they are not guaranteed to be compatible with their future versions (if any).

3.2 DMAPI events and sessions

DMAPI events are the mechanisms that allows a DM application to be notified whenever certain operations occur in the operating system kernel. DMAPI events are very similar to watchdogs; however, DMAPI specification attempts to very carefully define event classes and types, format of the data delivered by the events (event messages), as well as conditions which cause event generation.

DMAPI supports synchronous and asynchronous events. When a synchronous event is generated, a user process is suspended in the kernel; it will be suspended until a DM application issues an explicit response to the event. The format of the response includes a code indicating desired action by the user process: continue or abort. Asynchronous events are for notification purpose only; they may indicate a completion (or failure) of certain operations. They do not require response and do not block other processes.

There are five classes of DMAPI events:

A careful reader has probably noticed by now, that in order to build a working real-life DM application, some kind of event delivery control mechanism is required. For instance, it is unlikely that every DM application will be interested in receiving all possible event types. Excessive event generation of synchronous events can take a heavy toll on the system performance. DMAPI uses introduces sessions as the primary communication channels between DM application and the kernel component of DMAPI. Almost all DMAPI calls require a session argument. A DM application creates a session by calling dm_open_session(). A new session is created, and the DM application may register event dispositions for this session by calling dm_set_disp(). Dispositions indicate which event types should be delivered to this session. Another interface controls event generation. It is called dm_set_eventlist(); it allows certain event types to be enabled on per object basis. For instance, one file system may have all namespace events enabled, while another file system enabled only 'after' events. Data events are not controlled by dm_set_eventlist(). Once the session is created, dispositions are set and the events are enabled; the DM application is ready to receive events for this sessions. The DMAPI call dm_get_events() is used for this purpose. The call will block until an event is queued on the session; an application may specify a non-blocking mode when the call will return immediately if no event are queued on the session. Multiple events may be returned by a single call.

An interesting question is what happens if an event generation is enabled by either explicit setting of event list bit or implicitly by existence of managed regions, but no session has registered a disposition for the event. DMAPI specification leaves it to particular implementation to define the behaviour. For example, an implementation may choose to ignore event generations and proceed with the user process execution for all event types except data events. Ignoring data events in this case may potentially present data integrity problems, so an error is returned to user process.

Sessions and event lists allow DM application to exercise a fine control of the event delivery and generations. Applications may direct different event types to different sessions and avoid session overloading. From the author's experience, it generally makes sense to direct both synchronous and asynchronous events to different sessions for system performance reasons: asynchronous events may be generated at a very high rate because they are not blocking any processes (imagine how many postremove events you will get as a result of recursive directory tree removal!); if they are delivered to the same session as synchronous data events, than it is likely that DM application will have to retrieve (and potentially process) a large number of asynchronous events before it gets around to the data event. But it is the data event that keeps a user process blocked. Design decisions like that may directly affect system response time and performance in general.

3.3 Event Tokens and Access Rights

A token is a reference to the kernel state associated with a synchronous event. An outstanding synchronous event means that there is a suspended user process waiting for the response to the event. Such process may hold certain system resources - for example, internal operating system locks. DMAPI tokens are a mechanism which abstract the knowledge of the kernel state details and present it to data management applications in a portable and consistent fashion. In the interface, tokens represent as opaque scalars; they are created by the kernel component when a synchronous event is delivered to a session and they stay valid for as long as the corresponding event remains outstanding. Successful response to an event destroys the event token.

As tokens refer to a state maintained by the kernel, they are not tied to any particular system process context. A DM application which has received a token as a result of a dm_get_events() call is free to pass it to other threads and application processes for further use.

DM applications use the tokens to obtain access rights to file system objects to guarantee stability of the objects. Access rights may be shared and exclusive . Shared right allows read-only access; exclusive right allows modification to the object. The access rights provide a portable abstraction of the internal operating system locks. Note that such locks are different from mandatory file and record locking which are tied to process and file descriptors. Access rights are tied to DMAPI tokens; a response to an outstanding event destroys its token and releases all access rights that might have been associated with it.

Event tokens are arguably the most important concept in DMAPI, although not an entirely new one. They enable design of recoverable robust data management applications. A DM application which has crashed for whatever reason, can recover its state from the kernel state upon restart and continue the work.

3.4 File Data Control, Managed Regions and Data Residency

DMAPI provides several ways to control user access to file data. Among those are managed regions, ability to do invisible data I/O and interface to determine and directly control local disk allocation for the file data. The terminology used in the following discussion is defined as following:

Please note that the DMAPI data flow model does not allow access to the data if it is not copied to the local disk first.

As has been mentioned above, data event generation is not controlled by event lists but by managed regions. Managed regions are the mechanisms for an application to control file data access at a granularity level less than file size. A managed region is an extent in the logical file space; it is described by its starting offset, length and event generation flags. For example, a managed region with offset=0, length=8192, flags=READ|WRITE|TRUNCATE, means that any attempt to access file data between offset 0 and 8192 will trigger a data event. Actual event type depends on the type of access: for example, a write attempt generates a write event. Managed regions may not overlap, but they do not have to be contiguous. The geometry of managed regions does not have any relationship to the real geometry of the file; for instance, managed regions may cover logical area beyond the end of the file. A managed region may have none of its flags set; it is equivalent to not having managed region at all. DMAPI provides interfaces to set managed regions on a file and to obtain the currently set managed regions.

Managed regions are typically used to protect files which have parts of their data staged out; that is, moved to the tertiary storage. In such cases, a DM application will set a managed region on the file; all event flags will be enabled and a disposition set for all data events. If a user process attempts to read the file data, the kernel component of DMAPI will generate a read event and suspend the user process. The DM application receives the read event, reads the file data from the tertiary storage, writes it to the file being accessed and then responds to the user event. The kernel unblocks the user process and it proceeds to read file data.

A careful reader will notice that the above data flow scenario, is not possible as described: in order to bring the data from the tertiary storage to the file on the local storage, a DM application should be able to write to the file while managed regions are set and not generate a write event. DMAPI provides special interfaces for accessing the data 'under cover', bypassing event generation code. These interfaces are often called invisible I/O. Their semantics are similar to regular read(2) and write(2) system calls, except they do not generate data events and they do not modify file timestamps.

DMAPI also defines interfaces which enable efficient control of the local disk storage allocation and moving the data from the local disk to the tertiary storage. Dm_get_allocinfo() call returns the logical map of the local disk allocation. Dm_punch_hole() allows a DM application to 'punch' a hole in the file data allocation with the hope of freeing local disk storage. A DMAPI implementation may impose its own restriction on the allowed geometry of the 'holes'.

This whole area of local disk storage allocation, holes and file data residency has been a source of considerable confusion to the data management application designers. Sometimes they tend to regard local disk allocation as an indicator of data residency. Apparently it is not the case, as a regular sparse file often cannot be distinguished from a file with non-resident portions of data. A DMAPI implementation does not know anything about file data residency, it provides mechanisms described above to maintain a coherent relationship between data residency file allocations. It is the responsibility of the DM application to use these mechanisms in a proper manner.

4 Extended attributes in DMAPI

Extended attributes is arguably one of the most controversial issues in DMAPI. A managed region is an example of a data management attribute defined by the specification. Another example could be event generation bit mask when set on an object. The term 'extended' means that these attributes are extensions to the standard file attributes as returned by stat(2) system call.

Such attributes may be persistent and non-persistent. Existence of non-persistent attributes may be guaranteed only by acquiring an access right to the object. It is clear that in order for the managed regions to be of any use, some kind of persistence guarantee must be provided. The most straightforward way is to provide persistent storage for the attributes on the local disk. However, some operating system vendors were reluctant to commit to this as the only solution as it required them to change their local disk format. An optional debut event was introduced to support those so-called 'zero-bit' implementations. A debut event is generated every time an object (file, directory) becomes active, for example as a result of open(2) or stat(2) system calls. A DM application may than set non-persistent attributes for that object. Needless to say, such implementations are very slow and inefficient (imagine a synchronous event generated for every stat(2) call!). Besides, the application still needs to store these attributes persistently somewhere, usually in some 'look-aside' database. Maintaining a coherent relationship between such database and the local file system in the presence of system crashes is a non-trivial task.

DMAPI specification defines interface for generalised opaque extended attributes. Unlike the specification defined data management attributes (like managed regions), the format of opaque extended attributes is left to the DM application. Opaque extended attributes are always persistent; one can think of them as alternated data streams in a filesystem object. The major difference in semantics is the access method: opaque extended attributes can only be set and read as whole - they cannot be read and written at an arbitrary offset in the stream. DMAPI attributes are named - an attribute name is an 8 byte sequence. It does not have to be an ASCII string, though in reality most of the names are.

Typically, DM applications use opaque extended attributes to store file specific data. An example could be pointers to the file data location on the tertiary archive, migration policy attributes and in general what an application may want to store persistently. Extended attributes essentially eliminate need for 'look-aside' databases in DM applications. As has been pointed out earlier, 'look-aside' data bases present a fundamental problem with respect to coherent crash recovery and state synchronisation.

Extended attributes are not a new concept introduced by DMAPI. For example, the UNIX International Stackable File System requirements document includes requirements for generalised extended attributes. Unfortunately, very few software vendors have actually implemented extended attributes in their file systems. Access Control Lists (ACL) are often cited as an example of extended attributes; however, their implementations are usually very specific and cannot be easily adapted for general purpose.

This reluctance on the part of the system vendors to commit to extended attribute support, forced opaque extended attributes in DMAPI to become optional feature in the interface. Also, some of the attribute interface issues - for example, attribute inheritance - have not been addressed properly. In the author's opinion the optional character of the opaque extended attributes in the DMAPI specification has significantly reduced the value of the standard to the DM application developers who may be forced to design to the lowest common denominator. On the other hand, it presented a good incentive for the DMAPI implementors to differentiate their products by providing full support for the opaque extended attributes.

Figure 2 illustrates the DMAPI model data flow for reading a nonresident file.

5 Veritas implementation of extended attributes in DMAPI

Veritas Software Corporation is the leading supplier of data storage management software. Its products, which include Veritas File System (VxFS), have been licensed by all major computer systems and system software vendors and widely distributed, often under different aliases. For example, Hewlett-Packard JFS and Online JFS products are in fact VxFS ports to the HPUX operating system. Veritas has been an active member of the DMIG since its inception.

Veritas has implemented the full DMAPI specification in its VxFS product, including opaque extended attributes. A consistent attribute implementation required a different approach than that assumed by the DMAPI specification. As has been mentioned before, attribute inheritance presents an interesting challenge to the designer of generalised attribute implementation. Any attribute class may have its own inheritance rules so the internal design must be flexible. Attribute operations must follow all VxFS data modification semantics; that is, be performed as, or part of, file system transactions. Attribute operations should not noticeably reduce file system performance. DMAPI imposes the additional requirement that DM application developers should be able to write their own attribute handling code and implement their own attribute inheritance rules without source code modification to the base filesystem product and ideally without any help from Veritas. After careful consideration, it has become clear that in order to satisfy those requirements, an attribute agent kernel interface must be introduced. This interface is documented, and a sample agent code is provided to assist DM application developers. One can consider it Veritas extensions to DMAPI. On the architectures which support kernel loadable modules, attribute agent is usually implemented as such.

The attribute agent includes up to 6 functions called attribute intervention routines:

The experience with the above design approach has been very positive. Some DM application vendors expressed initial reluctance to write a kernel module; however, after they carefully examined the reasons to do so and the deficiencies in the DMAPI approach to extended attributes, they changed their minds. The amount of kernel code that needs to be written is minimal depending on how fancy DM application designer want to be. It may vary from a few dozens lines of C code to several pages. The code is procedurally simple; all it does is buffer manipulation and 'bit-shuffling'. There is no need to implement any kind of kernel locking or synchronisation.

6 Current Status

Several major vendors implemented DMAPI specification with various degrees of compliance. The author knows about implementations by Convex Computer (now a division of Hewlett-Packard) and Silicon Graphics as a part of XFS product.

As has been mentioned above, Veritas Software made DMAPI implementation a standard part of its file system product. Given the large OEM customer base, this effectively puts DMAPI on every major UNIX vendor platform.

A number of data management application vendors have deployed DMAPI-compatible applications. A major storage vendor, EMC Corporation, has released a new version of EpochServ for Solaris which is DMAPI-based.

DMAPI specification was presented to the X/Open System Management Group in January 1996. At the time of writing (August 1996), X/Open has published what it calls 'sanity check draft' intended for proofreading and minor syntax corrections. A full X/Open standard status is expected any day now. Please note that X/Open has changed the name of DMAPI to XDSM (X/Open Data Storage Management Specification).

7 Experiences and the Future

The working process at DMIG has had a lot of similarities to the designing of a new and complex software product. Both application and system vendors had to review the design of their respective products and often they discovered problems with their products they did not know existed.

Several fundaments problems do exist with DMAPI specification as it exists. One of them, the extended attribute architecture has been discussed above. It has also become clear that DMAPI-defined interfaces are not sufficient to implement robust high speed backup and restore systems. Distributed file systems issues, OSF, - DCE integration in particular - have not been addressed at all. There is no security in DMAPI interface other than an assumption that all DMAPI calls are made by processes with supersuser credentials.

DMIG members have deliberately decided not to address these issues in the current revision of the specification. Some of those issues (like DCE integration) are not clearly defined in the first place. Another major factor was a common goal to agree on the workable common set of features which would allow development and deployment viable commercial products and leave unresolved issues to be addressed in the next revision of the specification. This goal has been achieved.

With the upcoming X/Open approval and a large number of applications and implementations released and deployed, it may be concluded that the DMIG has been a very successful example of multiple vendor cooperation working towards common goals. On the other hand, the whole hierarchical storage management market has always been 'about to really take off' but never actually did. Falling disk prices made the off-line storage cost benefits less attractive; however, the cost of storage management remains a major factor in favour of HSM solutions.

From the author's perspective, an important result of this DMAPI experience was development of the new software technology and a fresh outlook on data storage issues in general. The ideas and software modules developed during DMAPI implementation at Veritas have been successfully applied in other company products and projects.

8 Additional information

The most recent version of the DMAPI specification is version 2.3a. It is available from an anonymous ftp server at acsc.com. Questions and comments may be addressed to the mailing list dmig@epoch.com. To obtain X/Open XDSM document, please contact X/Open Ltd. directly at http://www.xopen.org.

Acknowledgements

The author would like to express special gratitude to his fellow file system engineers at Veritas Software Corporation - in particular to John Carmichael, whose experience and energy played a key role in defining DMAPI architecture, and to Marianne Lent, whose meticulous attention to detail made Veritas implementation of DMAPI possible. This paper has been sponsored by Veritas Software Corporation.

Author information

Alexander Miroshnichenko is a Senior Member of Technical Staff at Veritas Software Corporation in Mountain View, California where he has been working on advanced file systems for almost 4 years. He has been representing Veritas at the DMIG since early 1994. He received his Masters degree in Applied Physics from Moscow Institute of Physics and Technology and has been working on various aspects of UNIX storage management for the past 10 years. Alex can be reached at alex@veritas.com.

Bibliography

1
Webber, N. (1993) Operating System Support for Portable Filesystem Extentions in Proceedings USENIX 1993 Winter Conference, San Diego, CA.

2
Lawthers, P. (1995) The Data Management Application Programming Interface in Proceedings 14th IEEE Mass Storage Systems Symposium 1995, Monterrey, CA., pp. 327-335.
URL: http://www.computer.org/../../../conferen/mss95/lawthers/lawthers.htm

3
Unix International Stackable File System Working Group, (1993) Requirements for Stackable File Systems, Rev 3.6.

4
Bershad, B. & Pinkerton, C. (1988) Watchdogs: Extending the Unix File System in Proceedings USENIX 1988 Winter Conference, Dallas, TX.

5
Carmichael, J. & Shelat, R. (1996) A Replicated File Service in AUUG 96 & Asia Pacific World Wide Web 2nd Joint Conference Proceedings, Melbourne, Australia, pp. 60-72.


Organised by: AUUG'96 & CSU Return to Conference Proceedings