Sun SITE Northern Europe - The Archive Continues...

Stuart McRobert
Department of Computing,
Imperial College, London, England
Email: sm@doc.ic.ac.uk

Abstract

This paper looks behind the scenes at many of the disk and file system issues involved in supporting and attempting to expand the size of a large and active archive site - Sun SITE Northern Europe. The present hardware configuration and support for various file systems are discussed, including experience of disk concatenation, level 5 RAID support both in software and hardware, and the need for UFS logging (trans) coupled with the disk mirroring. Some software services are also mentioned along with brief ideas for improving archive network connectivity. Finally, significant expansion of the archive to cope with ever-increasing demands on resources is eagerly anticipated.

1 Introduction

Sun SITE Northern Europe is a hugely successful archive site providing access to a vast collection of information and a wide range of user services. This paper will concentrate mainly on the storage technology underlying it, both in terms of the software and hardware used to support the archive's extensive file systems, and also to share many of the experiences gained over the past few years.

2 Brief History

A long long time ago when the VAX reigned supreme and Internet connectivity was just a distant dream, a few of us started to save freely distributable software collected from distant universities in a common area. The idea was fairly simple: before embarking on the often lengthy and arduous task of attempting a complex download from some distant site (which might take a week or more to complete), one would first look to see if someone else had already obtained the software, or was already attempting to download it. This saved time, network bandwidth and needless wasteful duplicated effort. It also widened people's knowledge of what was available - browsing the archive started very early on.

Hence the idea for the archive was born - almost out of necessity, due to poor network connectivity (well at least some things never change) - and the number of local users soon grew considerably. However, it soon became clear that there were far more users than people willing to fetch files back. One person, Lee McLoughlin, in his 'spare' time, took on the task of maintaining the content of the archive, something that he does to this day. Meanwhile, I concentrated on the hardware side; hence the bias towards it in this paper. Help early on in getting the archive going in terms of new disk space and controllers came from the UK Unix User Group (UKUUG), along with many donations from the academic research community who provided numerous hand-me-down SMD disk drives (chiefly the trusty Fujitsu Eagle) as research groups moved over to new computer systems using SCSI disk drives instead. Many other UK universities heard about what we were doing and the popularity of the archive climbed with access over the UK academic X.25 network (JANET) of those days.

In the summer of 1993, Sun offered us the opportunity of being one of their first large Sun SITEs, replacing our seriously over-burdened Sun 4/360 with a powerful new Sun SS1000, and most important all new fast SCSI disk drives. Sun Microsystems has kindly continued to upgrade the system ever since, so helping make Sun SITE Northern Europe one of the most successful archives in the world.

3 People

Special thanks should go to my colleague Lee McLoughlin for all the work he has done in his spare time (evenings/weekends) and the extensive range of software he has developed and released to help maintain the archive. Most notable is probably mirror, which automatically updates local copies of remote files from distant FTP sites and so helps to keep the archive right up-to-date (networks and available disk space permitting, of course). A few other specialists also help to maintain specific areas of the archive, thank you.

4 Sun SITE Northern Europe

4.1 Present Day Configuration

In mid-August 1996, Sun SITE Northern Europe comprised the following hardware configuration: This provides the sound hardware platform on which the archive now runs, but although it may sound a lot, it gets to be very busy indeed, especially during term time when all the students are accessing it too.

4.2 Disk Storage

Since disk space is the key to the underlying operation of any archive, it is worth reviewing how the deployment of disks has changed over the past few years. Disks supplied by Sun are usually housed in conventional Sun style 19in. disk trays, each of which can mount up to six 5.25in. disk drives with either one or two SCSI buses per tray. Each SCSI bus is then connected to a Sun differential SCSI SBUS controller, with differential SCSI being chosen simply because of the cable lengths involved.

The number of SCSI devices (for example, disk drives) per controller is always tricky to decide. Clearly, for maximum performance a one-to-one mapping would be ideal, although not very practical since the number of host expansion slots available to accept controllers tends to be limited - for example, a SS1000 has three slots per motherboard, and up to four motherboards per system (and some slots will be required for other uses such as networking). On the other hand, two to three drives per controller (half a Sun disk tray) is a reasonable compromise for general use and normally provides quite acceptable performance. However, as more disks arrive, even this density of drives per controller is hard to accommodate. Finally, with the introduction of RAID controllers supporting six or more of their own fast SCSI buses, six drives per RAID SCSI bus becomes a fairly popular drive density with perhaps a seventh drive as a warm/hot spare on the bus too. This density of drives actually works out quite well for RAID 5 use, since the I/O is fairly uniformly spread across all the disks and the transfers can be highly interleaved to make excellent use of bus bandwidth. The RAID controllers themselves are then connected back to the host via a dedicated fast wide differential SCSI controller or maybe even via fibre channel to maximize available bandwidth.

4.3 Archive Storage Expansion

Initially, to meet the ever-increasing demand for more disk space on the archive, additional disk drives were quickly added and simply 'grown' onto the end of the existing file system via the Solaris meta disk growfs command. However, a number of issues started to arise:

Sun supplied additional disk space to initially support the migration to a level 5 software RAID subsystem, but it would not be alone. With Sun's support, a donation of some 200 GBs of disk space from Quantum was made available out of a Terabyte of disk space kindly donated to support the Central Parks located at each Global Internet Exchange (GIX) points (of which the archive is one), for The Internet 1996 World Exposition. Now, in order to interface all these drives to Sun SITE, significantly more SCSI buses would be required.

4.3.1 SPARCstorage Array

Sun kindly provided us with a SPARCstorage Array model 200 to help with our ongoing requirement to support ever more disk space. The Sun SPARCstorage Array provides high performance and reliability, flexible configuration with Sun supplied drives, and ease of use. Some of its key features are:

4.3.2 Disk Arrays

This section attempts to quickly introduce some of the key information, issues and benefits of using disk arrays, along with details of the two levels of RAID that are currently used on the archive.

Disk arrays first became really popular with supercomputers because of the need for very high performance I/O along with vast storage requirements, often easily exceeding the capacity and performance of single specialist disk drives. However, by spreading the data across multiple high performance drives via specialised controllers, dramatic performance improvements could be achieved with transfer rates approaching the aggregate of the combined drives.

In 1987, researchers at Berkeley formally proposed the idea of using Redundant Arrays of Inexpensive Disks, or RAIDs, for much more general use. They wanted to achieve the performance and reliability of larger, much more expensive drives whilst using a number of much cheaper smaller units. The use of multiple drives also required a means of protecting the data from the loss of any single drive in the array, and helped raise MTBF values. As a consequence, RAIDs are considered fault tolerant and many users buy them for that reason alone, although there are much wider benefits, including briefly:

Although there are several different types of RAID, all systems achieve higher reliability and/or performance by duplicating or spreading data between multiple disk drives. Just how this is done can seriously effect performance, both in terms of read/write transfer rates (MB/s to or from the RAID) and I/O operations per second (IOPS) which measures how good the RAID is at handling multiple independent I/O requests per second. It is worth noting that a RAID with high transfer rates may not necessarily have good IOPS.

Although a detailed review of different RAID levels might be interesting, consideration here will only briefly be given to RAID levels 1 and 5, which are used on Sun SITE NE.

RAID level 1 is simple mirroring and is designed to optimise data availability rather than speed, since each write is duplicated to each mirror disk involved, resulting in a small write performance penalty since the system has to wait for all the drives to complete. Reading is actually improved since the least busy drive can be selected for the operation.

RAID level 5 comprises striping (RAID level 0) with interleaved parity. Striping simply breaks up the data stream into equal sized chunks or stripe blocks and writes them sequentially across multiple drives in a 'stripe'. To avoid data loss resulting from the failure of a single drive, parity information and data are interleaved across the disk array in a cyclic pattern. Should a single disk fail, a simple algorithm allows the absent data to be restored from the remaining disks.

RAID 5 is generally considered to be a good compromise, providing reasonable random access performance without the expense of mirroring, but the down side is an expensive overhead for small writes. A single block write at RAID 5 requires four I/O operations plus two parity calculations in a read-modify-write sequence.

On Sun SITE NE mirroring is only used for some very special data - the trans or logging file system logs, with the vast majority of the archive data itself now safely resident in RAID 5 file systems.

4.3.3 Solstice DiskSuite 4.0

Sun's Solstice DiskSuite, sometimes also known as Online: DiskSuite (ODS), provides a convenient set of tools (command line and graphical) to support functions like:

UFS logging has several important benefits that make it highly attractive for work with large archive file systems, including:

Briefly, logging works as follows. When a file system update is required to a trans file system, the request is logged to the separate trans area (a different disk partition) which is itself typically 2 or 3-way mirrored for resilience. Mirroring is vital since one does not wish to loose any updates to the trans area and hence the consistency of the main file system itself.

Solstice DiskSuite 4.0 works fairly well, but it does appear to have some limitations, including:

4.3.4 Quantum Drives

All attempts to connect Quantum's XP34301 4.3 GByte Grand Prix differential SCSI disk drives to the SPARCstorage array failed without success. The storage array would just 'hang' and/or fail to see any drives at all, even Sun drives on other SCSI buses. This was rather disappointing and rather time-consuming. The drives were found to 'kind of work' on another Sun SCSI controller, but never very satisfactorily, suffering SCSI bus resets and other problems. It would appear that Solaris SCSI drivers only really like Sun-supplied drives, which one understands have received special performance enhancing firmware changes.

Having gained experience of Solstice DiskSuite and its software RAID, but still faced with the problem of successfully interfacing the Quantum drives, our attention now turned to a kind offer to try out a hardware RAID controller.

4.3.5 Hardware RAID

Trimm Technologies kindly donated several of their MRS 19in. rack mount enclosures to house the Quantum Grand Prix disk drives. When they learnt of the operational difficulties encountered with interfacing the Quantum drives to Solaris, Trimm Technologies offered the loan/evaluation of a brand new high performance SCSI-to-SCSI RAID controller that they distribute. The CMD Technology CRD-5500 RAID controller has, since going into service, worked flawlessly with both the Quantum drives and Solaris. Although it is still early days, it looks like a very promising solution. Features include (manufacturer's figures):

CRD-5500 RAID controller


Example use of a CRD-5500 RAID controller

4.3.6 Power Fail

One of the few spectacular hardware failures to strike the archive took place early on Easter Monday, 1996. Quite simply, the power failed when the main cabinet circuit breaker correctly tripped out. The cause turned out to apparently be a faulty switch mode power supply inside a Sun disk tray. It seems that the individual trays do not appear to have their own fuses/breakers and the loss of complete service as the result of a single hardware failure is worth watching out for when implementing fault tolerant systems. Apart from that one incident, the Sun hardware has been very reliable.

4.4 Sun SITE NE - Software Services

A large amount of effort has gone into both writing specialised tools to help maintain the archive (for example, mirror) along with support for user level applications to either search this and/or other archives.

Just some of the services available include:

FTP and FTP mail
FTP provides the normal FTP access to the archive (FSP is also available), using the very popular wu-ftpd software, whereas FTP mail allows users lacking FTP access to submit FTP requests by email and have the resulting output returned also by email - a very popular service.
Archie
The Archie service allows users to search a large database containing the location of file names at sites all around the world.
News
Sun SITE NE also acts as a large Usenet news hub especially for the UK academic community, but doesn't provide end users with news access.
Sources account
Users can telnet to the archive and login as sources which allows them to browse the archive to help find files or prepare them for later down load by whatever means they choose.
NFS and SMB access
Sites can NFS mount the archive (read-only of course), and PC users can also gain access via SMB (LanWare).
WWW and Web cache(s)
Naturally, the archive can be accessed via the Web, but it also provides a large Web cache (harvest/squid) inter-working with other caches to help save network bandwidth and hopefully quickly provide users with the pages for which they are looking.
Access to Various Web Search Engines
A Web page is setup to make this as easy as possible for users.

4.5 Sun SITE NE - Content

Currently, Sun SITE NE mirrors some 500 packages from sites all around the world, and that's only been limited by available disk space until recently at around 70 GB. For an up-to-date list of files, carefully consider downloading the large listing file (currently 16 MB after compression) ls-lR.Z, or, better still, note that the FTP command here supports scanning for files by pattern, ls -sf:pattern - for example, ls -sf:emacs-18.58.tar.Z.

4.6 Sun SITE NE - Connectivity

One of the problems arising from the success of the archive has been the large amount of network traffic generated by this site. To help ease this, we are now inviting IP providers interested in providing better connectivity to Sun SITE NE for their customers to step forward and discuss how we might proceed - for example, with direct connections. Hopefully this will help ease the burden on already over-stretched academic networks and improve our global connectivity. However, it has to be noted that a lot of Sun SITE NE traffic does currently go to other UK academic sites, although improvements in academic network bandwidth connectivity are also being discussed.

4.7 Sun SITE NE - The Next Generation

Sun SITE NE has been phenomenally successful, so much so that there are times when it is simply swamped with users and response suffers. To help solve this problem, a significant upgrade has been proposed to Sun and as of mid-August 1996, it is reported to have been agreed to and the final configuration just has to be confirmed. All being well, the new system is expected to comprise the present system plus a very powerful top-of-the-range Sun Ultra Enterprise server.

It is hoped that the combined power of these two servers will provide a sound platform for the future and the capacity to support interesting new services.

5 Conclusion

What started as a small, convenient file archive for a few keen programmers has grown into one of the world's largest and very popular archives, providing not just a vast collection of public access files collected from all around the world, but a web cache, a search engine, and a whole lot more. With hopefully any growing pains now over, Sun SITE Northern Europe looks like having a bright and promising future.

Bibliography

1
http://sunsite.doc.ic.ac.uk/

2
http://www.doc.ic.ac.uk/f?/lmjm

3
http://www.ukuug.org/

4
http://www.sun.com/

5
http://www.quantum.com/quantum.html

6
http://www.park.org/

7
http://www.sun.com/products-n-solutions/hw/peripherals/array.200.html

8
http://www.quantum.com/products/manuals/gp-scsi-manual/chap2.html#2.1

9
http://www.trimm.com/

10
http://www.cmd.com/

11
http://www.sun.com/products-n-solutions/hw/servers/ultra_enterprise/6000/index.html


Organised by: AUUG'96 & CSU Return to Conference Proceedings