Sun SITE Northern Europe - The Archive Continues...
Stuart McRobert
Department of Computing,
Imperial College, London, England
Email: sm@doc.ic.ac.uk
Abstract
This paper looks behind the scenes at many of the disk and file system issues
involved in supporting and attempting to expand the size of a large and
active archive site - Sun SITE Northern Europe. The present hardware configuration
and support for various file systems are discussed, including experience
of disk concatenation, level 5 RAID support both in software and hardware,
and the need for UFS logging (trans) coupled with the disk mirroring.
Some software services are also mentioned along with brief ideas for improving archive
network connectivity. Finally, significant expansion of the archive to cope with ever-increasing demands on resources is eagerly anticipated.
1 Introduction
Sun SITE Northern Europe
is a hugely successful archive site providing access to a vast collection of
information and a wide range of user services. This paper will concentrate
mainly on the storage technology underlying it, both in terms of the software and hardware
used to support the archive's extensive file systems, and also to share many of the
experiences gained over the past few years.
2 Brief History
A long long time ago when the VAX reigned supreme and Internet connectivity
was just a distant dream, a few of us started to save
freely distributable software collected from distant universities in a common
area. The idea was fairly simple: before embarking on the often lengthy
and arduous task of attempting a complex download
from some distant site
(which might take a week or more to complete),
one would first look to see if someone else had already obtained the software,
or was already attempting to download it.
This saved time, network bandwidth and needless wasteful duplicated effort.
It also widened people's knowledge of what was available - browsing the archive
started very early on.
Hence the idea for the archive was born -
almost out of necessity, due to poor network connectivity (well at least
some things never change) - and the number of local users soon grew considerably.
However, it soon became clear that there were far more users than people willing to
fetch files back. One person,
Lee McLoughlin, in his 'spare' time, took
on the task of maintaining the content of the archive, something that he does to this day.
Meanwhile, I concentrated on the hardware side; hence the bias towards it
in this paper.
Help early on in getting the archive going in terms of new disk space and controllers
came from the UK Unix User Group
(UKUUG), along with many donations from the academic research community
who provided numerous hand-me-down SMD disk drives (chiefly
the trusty Fujitsu Eagle) as research groups moved over to new computer systems using SCSI disk drives instead.
Many other UK universities heard about what we were doing and the popularity
of the archive climbed with access over the UK academic X.25 network (JANET)
of those days.
In the summer of 1993, Sun offered us the opportunity of being one of their first large Sun SITEs,
replacing our seriously over-burdened Sun 4/360 with a powerful new Sun SS1000,
and most important all new fast SCSI disk drives.
Sun Microsystems has
kindly continued to upgrade the system ever since, so helping make
Sun SITE Northern Europe one of the most successful archives in the world.
3 People
Special thanks should go to my colleague Lee McLoughlin for all the work
he has done in his spare time (evenings/weekends) and the extensive range
of software he has developed and released to help maintain the archive. Most notable is
probably mirror, which automatically updates local copies of remote files
from distant FTP sites and so helps to keep the archive right up-to-date
(networks and available disk space permitting, of course).
A few other specialists also help to maintain specific areas of the archive,
thank you.
4 Sun SITE Northern Europe
4.1 Present Day Configuration
In mid-August 1996, Sun SITE Northern Europe comprised the following
hardware configuration:
-
A Sun Enterprise Sparc Center 1000 with,
-
Eight 50 MHz SPARC CPUs, each with 1 MB external caches,
-
A GByte of main memory (RAM) and 8 MB of battery backed NVRAM (for presto server use),
-
Around 170 GBs of available online disk space (i.e. space one can actually use),
-
Two FDDI interfaces, one dual and one single attach,
-
Several Ethernet and SCSI interfaces,
-
A Trimm Technologies supplied CMD CRD-5500 SCSI RAID controller,
-
A Sun Sparc Storage Array 200,
-
A Sun SPARC Storage Library (Exabyte based) tape stacker unit, and
-
A few other useful peripherals (CDs, tape drives, etc.).
This provides the sound hardware platform on which the archive now runs,
but although it may sound a lot, it gets to be very busy indeed, especially during term time when all the students are accessing it too.
4.2 Disk Storage
Since disk space is the key to the underlying operation of any archive,
it is worth reviewing how the deployment of disks has
changed over the past few years. Disks supplied by Sun are usually
housed in conventional Sun style 19in. disk trays,
each of which can mount up to six
5.25in. disk drives with either one or two SCSI buses per tray.
Each SCSI bus is then connected to a Sun differential SCSI SBUS
controller, with differential SCSI being chosen simply because of the cable lengths involved.
The number of SCSI devices (for example, disk drives) per controller is always tricky
to decide. Clearly, for maximum performance a one-to-one mapping would be
ideal, although not very practical since the number of host expansion slots available
to accept controllers tends to be limited -
for example, a SS1000 has three slots per motherboard,
and up to four motherboards per system
(and some slots will be required for other uses such as
networking).
On the other hand, two to three drives per
controller (half a Sun disk tray) is a reasonable compromise for general use
and normally provides quite acceptable performance.
However, as more disks arrive, even this density of drives per controller is hard to accommodate.
Finally, with the introduction of RAID controllers supporting six or more
of their own fast SCSI buses, six drives per RAID SCSI bus becomes a fairly popular drive density with
perhaps a seventh drive as a warm/hot spare on the bus too. This density of drives
actually works out quite well for RAID 5 use, since the I/O is fairly uniformly
spread across all the disks and the transfers can be highly interleaved to make
excellent use of bus bandwidth.
The RAID controllers themselves are then
connected back to the host via a dedicated fast wide differential SCSI
controller or maybe even via fibre channel to maximize available bandwidth.
4.3 Archive Storage Expansion
Initially, to meet the ever-increasing demand for more disk space on the archive,
additional disk drives were quickly added and simply 'grown' onto the end of
the existing file system
via the Solaris meta disk growfs command. However, a number of issues started to arise:
-
When a Unix file system is initially made with the newfs command,
one can specify the ratio of the number of inodes (that is, files) to
file system data blocks (via the -i option).
Getting this well balanced at the start is very important since significant disk space
can be wasted either way, and it cannot be adjusted once the file system has been built.
Sun's recent default values for newfs have tended to create far too many
inodes for general use, wasting valuable disk space by pre-assigning inodes
for files that may never be created.
Fortunately, to Sun's credit, the defaults err on the right side, since
running out of inodes is much harder to work around than running out of free space but, with practice, one can do much better.
In general, when calculating the parameters for a new file system, simple linear
extrapolation of the present usage figures can be used (assuming no significant
change in file system usage is planned or expected), with an extra 20% or so of
inodes assigned as an adequate safety margin (well at least for our archive work - you might want to assign more).
It is generally much easier to compress (or better compress)
files when short of disk space rather than run out of inodes,
which is much much harder to work around.
However, over time, file system usage patterns change, and although one can easily grow a simple concatenated file system to provide more disk space, the ratio of inodes to file space remains fixed.
An option to growfs to adjust this for the next disk being added would be a really
useful management tool, although implementation maybe a bit tricky in
certain cases, and easier in others.
Meanwhile,
the only solution today is to save all the files and remake the file system,
a somewhat time-consuming process rarely appreciated by users or administrators
alike.
-
I/O performance of a large concatenated file system is often less than ideal.
For example, fsck the file system checker, would take ages to work its way across all our concatenated drives. Each disk drive's access light would plot fsck's slow progress across the file system (especially obvious during the first few passes), drive by drive, one by one.
When changed to a RAID 5 configuration with roughly the same size file system,
the time to complete a fsck was significantly reduced since the I/O was spread
uniformly across all the drives; that is, they were all busy with parallel rather than sequential
access. The time to fsck the file system was cut to roughly a third.
-
Each additional disk drive added to a concatenated file system reduces the
MTBF of the overall file system. So the longer the delay in introducing a
RAID solution, the more likely a catastrophic failure would be experienced and the bigger the conversion task.
-
But modern disk drives are highly reliable.
Does one really need to waste valuable disk space to, say,
RAID parity information just to improve file system availability?
Do you really feel that lucky?
Either way, as a result of careful risk assessment
(pure luck?),
none of the 15 or so disk drives comprising the
concatenated file system ever failed - but it did come close.
Two similar drives did fail, helping to convince people to support the move to
a RAID solution, which would also boost file system
I/O performance - an all round winner, reliable and faster.
-
A Level 5 RAID solution (striping with interleaved parity) was selected as the
most appropriate approach.
-
The introduction of a RAID 5 file system would also require both more disk space to
accommodate the extra parity data and preferably enough new disk space to allow
the new RAID to be built before destroying the old file system.
Such an approach had several key advantages:
-
Full user service could be maintained throughout the transition period from old to new,
albeit read only during the actual transfer period.
-
Disk to disk file system copies are much quicker and easier to perform than using backup tapes.
-
The 'feel good' factor - if anything goes wrong, one can always quickly and easily
fall back to the original file system. Of course, whenever one
takes such precautions they are rarely needed,
but when one doesn't, disaster always seems to strike.
Sun supplied additional disk space to initially support the migration
to a level 5 software RAID subsystem,
but it would not be alone.
With Sun's support, a donation of some 200 GBs of disk space from
Quantum was made
available out of a Terabyte of disk space kindly donated
to support the Central Parks located at each
Global Internet Exchange (GIX) points
(of which the archive is one), for
The Internet 1996 World Exposition.
Now, in order to interface all these drives to Sun SITE, significantly more SCSI
buses would be required.
4.3.1 SPARCstorage Array
Sun kindly provided us with a
SPARCstorage Array model 200 to help with our
ongoing requirement to support ever more disk space. The Sun SPARCstorage Array
provides high performance and reliability, flexible configuration with
Sun supplied drives, and ease of use. Some of its key features are:
-
A 40 MHz microSPARC controller with 2 MBs of DRAM for transient storage.
-
4 MB of battery backed NVRAM with ECC as a read/write cache,
allowing for immediate acknowledgments of host writes.
-
Full duplex 25 MB/s Fibre Channel interface to the host, which can be up
to 2km away (but ours is only about 1m).
-
Six intelligent Fast/wide (20 MB/s) SCSI-2 buses to the disks.
-
Over 15 MB/s sustained transfer rate.
-
2,000 two KByte I/O operations per second.
-
Firmware stored in 512 KB of Flash EPROM to allow updates to be easily down loaded from
the host.
-
Simultaneous support for RAID 0, 1, 0+1, 5 and independent spindles.
-
Internally, the SPARCstorage Array has a couple of buses, the TSBus (used for
moving data between the microSPARC CPU and program memory) and the PSBus
(which moves data between the disk drives and the Fibre
Channel interface and this bus has 80 MB/s bandwidth)
both of which are based on the SBus standard.
Several specialised chipsets are also used for SCSI, Fibre Channel Interface
and interconnection of the buses - for more detail please consult
The SPARCstorage Array Architecture, a Technical White Paper available from Sun.
4.3.2 Disk Arrays
This section attempts to quickly introduce some of the key information, issues
and benefits of using disk arrays,
along with details of the two levels of RAID that
are currently used on the archive.
Disk arrays first became really popular with supercomputers because of the
need for very high performance I/O along with vast storage requirements,
often easily exceeding the capacity and performance of single specialist
disk drives.
However,
by spreading the data across multiple high performance drives via
specialised controllers, dramatic performance improvements could be achieved
with transfer rates approaching the aggregate of the combined drives.
In 1987, researchers at Berkeley formally proposed the idea of using
Redundant Arrays of Inexpensive Disks, or RAIDs, for much more general use.
They wanted to achieve the performance and reliability of larger, much more expensive
drives whilst using a number of much cheaper smaller units. The use of multiple drives
also required a means of protecting the data from the loss of any single drive
in the array,
and helped raise MTBF values. As a consequence, RAIDs are considered fault
tolerant and many users buy them for that reason alone, although there are much wider benefits, including briefly:
-
Much higher data transfer rates (just how high depends on the type of RAID selected),
-
Larger number of I/O transactions per second,
-
Increased data and system availability, single drive failures can be tolerated,
-
Easier data management,
-
Reduced maintenance and down time.
Although there are several different types of RAID, all systems achieve higher
reliability and/or performance by duplicating or spreading data between
multiple disk drives. Just how this is done can seriously effect performance,
both in terms of read/write transfer rates (MB/s to or from the RAID)
and I/O operations per second (IOPS) which measures how good the RAID is at
handling multiple independent I/O requests per second. It is worth noting that
a RAID with high transfer rates may not necessarily have good IOPS.
Although a detailed review of different RAID levels might be interesting,
consideration here will only briefly be given to RAID levels 1 and 5,
which are used on Sun SITE NE.
RAID level 1 is simple mirroring and is designed to optimise data availability
rather than speed, since each write is duplicated to each mirror disk involved,
resulting in a small write performance penalty since the system has to wait for
all the drives to complete. Reading is actually improved since the least busy
drive can be selected for the operation.
RAID level 5 comprises striping (RAID level 0) with interleaved parity.
Striping simply breaks up the data stream into equal sized
chunks or stripe
blocks and writes them sequentially across multiple drives in a 'stripe'.
To avoid data loss resulting from the failure of a single drive, parity
information and data are interleaved across the disk array in a cyclic pattern.
Should a single disk fail, a simple algorithm allows the absent data to be
restored from the remaining disks.
RAID 5 is generally considered to be a good compromise, providing reasonable random
access performance without the expense of mirroring, but the down side is an
expensive overhead for small writes. A single block write at RAID 5 requires
four I/O operations plus two parity calculations in a read-modify-write
sequence.
On Sun SITE NE mirroring is only used for some very special data -
the trans or logging file system logs,
with the vast majority of the archive data itself now safely resident in
RAID 5 file systems.
4.3.3 Solstice DiskSuite 4.0
Sun's Solstice DiskSuite, sometimes also known as Online: DiskSuite (ODS),
provides a convenient set of tools (command line and graphical) to support
functions like:
-
Up to three-way mirroring of any file system,
-
Disk striping,
-
UFS logging (trans),
-
Online concatenation of physical drives,
-
Online expansion of file systems,
-
Disksets across one or two hosts,
-
Hot spare management, and
-
Creation and management of level 5 RAID devices (in software).
UFS logging has several important benefits that make it highly attractive for
work with large archive file systems, including:
-
Faster local directory operations.
-
Much faster reboots since any outstanding actions stored in the log can simply
be performed once the system comes back up. The file system is then known
to be in a sound stable state, so there is no need for a lengthy fsck to be performed, which can take hours for 100 GB sized file systems.
-
Decreases in synchronous disk writes by safely committing them to the log
before they are applied to the file system.
Briefly, logging works as follows.
When a file system update is required to a trans file system,
the request is logged to the separate trans area (a different disk partition)
which is itself typically 2 or 3-way mirrored for resilience.
Mirroring is vital since one does not wish to loose any updates to the
trans area and hence the consistency of the main file system itself.
Solstice DiskSuite 4.0 works fairly well, but it does appear to have some limitations, including:
-
ODS 4 does not expect the physical location of drives within a system to ever
need to change - for example, from controller to controller, or from controller to a SPARCstorage array.
So if one wishes to improve system I/O performance by adding some additional
SCSI controllers or by just moving drives around to rebalance the system,
it is not at all easy.
ODS encodes such data within its internal tables (almost out of
reach of the system administrator) chiefly based around Solaris major and
minor device numbers. Commands are provided to make such changes,
but it requires great skill and knowledge of how ODS works and the way it
interfaces with the system.
However, with the assistance of the ODS authors, a move of drives from individual disk
controllers to the SPARCstorage array was achieved, but took several hours of
painstaking work with telephone support, and wasn't in the end entirely successful.
Perhaps as the result of an earlier power loss,
or perhaps because of the failure to flush out a trans area properly,
or perhaps as a result of some other oversight,
somehow one file system became badly corrupted.
So in conclusion, don't try this at home with valuable corporate data without
adequate backups and lots of time to recover.
-
Should a disk drive fail within a RAID 5 array, it can take rather a long time for
ODS to resync a large file system and make the RAID resilient again.
Although the rebuild process will start automatically if a
spare drive is available, ODS uses the host computer CPU power to recompute all
the data. In theory, I'm told this should take around 35 minutes
per GByte on a lightly loaded SS1000, but Sun SITE NE is never lightly loaded.
Furthermore,
because of a bug (hopefully now fixed)
should the host system go down during a RAID rebuild,
ODS fails to automatically restart and continue the rebuild once service has
been restored.
Sadly, when the Sun UK 'hot line' was asked why a rebuild was taking so long (5 days),
they failed to recognise this known problem, although they did wonder why no
rebuild progress figures were being reported by metastat.
They just suggested leaving the resync to 'run' for a few more days over a weekend.
So if metastat just
reports resyncing without any percentage done figure, for example:
d12: RAID
State: Resyncing
Hot spare pool: hsp000
Interlace: 32 blocks
Size: 190241960 blocks
Original device:
Size: 190243488 blocks
Device Start Block Dbase State Hot Spare
c2t0d0s6 330 No Okay
c2t1d0s6 330 No Resyncing c2t6d0s6
c2t2d0s6 330 No Okay
...
seek professional help in restarting the RAID rebuild.
Regrettably, the 100 GB RAID above was in fact lost when a SCSI bus reset occurred
and the host system believed it had lost another SCSI drive.
There lies the problem: whilst a resync is underway, your RAID is vulnerable to
a second failure, and the longer it takes to complete, the bigger the risk.
Since it will take much longer to resync a really large file system - and they
tend to be well worth protecting - RAID 5 may not be the ideal way to store the data.
Alternatively, a faster way of resyncing the data, perhaps using hardware, might
be better.
-
As documented, Prestoserve may not be used with metadevices or
metadevices that are part of a trans device, which is unfortunate if you already
have the NVRAM installed.
-
Several firmware updates have been released for the SPARCstorage Array along
with other Solaris patches; it is generally worth keeping up-to-date.
4.3.4 Quantum Drives
All attempts to connect
Quantum's XP34301 4.3 GByte Grand Prix differential SCSI disk drives to the
SPARCstorage array failed without success. The storage array would just 'hang'
and/or fail to see any drives at all, even Sun drives on other SCSI buses.
This was rather disappointing and rather time-consuming.
The drives were found to 'kind of work' on another Sun SCSI
controller, but never very satisfactorily, suffering SCSI bus resets and other problems.
It would appear that Solaris SCSI drivers only really like Sun-supplied drives,
which one understands have received special performance enhancing
firmware changes.
Having gained experience of Solstice DiskSuite and its software RAID,
but still faced with the problem of successfully interfacing the Quantum drives,
our attention now turned to a kind offer to try out a hardware RAID controller.
4.3.5 Hardware RAID
Trimm Technologies kindly donated several
of their MRS 19in. rack mount enclosures to house the
Quantum Grand Prix disk drives.
When they learnt of the operational difficulties encountered with interfacing the Quantum drives to Solaris,
Trimm Technologies offered the loan/evaluation of a brand new
high performance SCSI-to-SCSI RAID controller that they distribute. The
CMD Technology CRD-5500 RAID controller has, since going into service,
worked flawlessly with both the Quantum drives and Solaris.
Although it is still early days, it looks like a very promising solution.
Features include (manufacturer's figures):
-
CRD-5500 SCSI to SCSI RAID controller
-
Over 6,450 IOPS and 17 MB/s SCSI Host Channel
-
32-bit LR33310 RISC (MIPS R3000 core) CPU
-
Support for RAID levels 0, 1, 0+1, 4, and 5
-
Convenient Modular, Scalable Design with up to 8 Fast Wide Differential SCSI
Disk Channels and up to 4 Host Channels, and a dual redundant controller option too.
-
Four Proprietary CMD RAID ASICs provide for advanced features and the highest
performance. Including:
-
XOR ASIC used in Exclusive-Or parity calculations at RAID 4 and 5,
-
DMA ASIC to control the data path hardware for the various I/O ports,
-
Memory Controller ASIC to control the memory system and support data movement
on the internal bus at high speed (80 MB/s burst, 60 MB/s sustainable),
-
CPU Interface ASIC supports the MIPS R3000 RISC CPU.
-
Battery backup for industry standard memory acting as a data cache (up to
512 MB)
CRD-5500 RAID controller
Example use of a CRD-5500 RAID controller
4.3.6 Power Fail
One of the few spectacular hardware failures to strike the archive took place
early on Easter Monday, 1996.
Quite simply, the power failed when the main cabinet circuit breaker correctly
tripped out. The cause turned out to apparently be a faulty switch mode power
supply inside a Sun disk tray.
It seems that the individual trays
do not appear to have their own fuses/breakers and the loss of complete
service as the result of a single hardware failure is worth watching out for
when implementing fault tolerant systems.
Apart from that one incident, the Sun hardware has been very reliable.
4.4 Sun SITE NE - Software Services
A large amount of effort has gone into both writing specialised tools to help
maintain the archive (for example, mirror) along with support for user level applications to either search this and/or other archives.
Just some of the services available include:
-
FTP and FTP mail
-
FTP provides the normal FTP access to the archive (FSP is also available),
using the very popular wu-ftpd software, whereas FTP mail allows users
lacking FTP access to submit FTP requests by email and have the resulting output
returned also by email - a very popular service.
-
Archie
-
The Archie service allows users to search a large database containing the location of file names at sites all around the world.
-
News
-
Sun SITE NE also acts as a large Usenet news hub especially for the UK academic
community, but doesn't provide end users with news access.
-
Sources account
-
Users can telnet to the archive and login as sources which allows them to browse
the archive to help find files or prepare them for later down load by whatever
means they choose.
-
NFS and SMB access
-
Sites can NFS mount the archive (read-only of course), and PC users can also gain
access via SMB (LanWare).
-
WWW and Web cache(s)
-
Naturally, the archive can be accessed via the Web, but it also provides a large
Web cache (harvest/squid) inter-working with other caches to help save network
bandwidth and hopefully quickly provide users with the pages for which they are looking.
-
Access to Various Web Search Engines
-
A Web page is setup to make this as easy as possible for users.
4.5 Sun SITE NE - Content
Currently, Sun SITE NE mirrors some 500 packages from
sites all around the world, and that's only been limited by
available disk space until recently at around 70 GB. For an up-to-date list of files,
carefully consider downloading the large listing file (currently 16 MB
after compression)
ls-lR.Z,
or, better still, note that the FTP command here supports scanning for files
by pattern, ls -sf:pattern - for example, ls -sf:emacs-18.58.tar.Z.
4.6 Sun SITE NE - Connectivity
One of the problems arising from the success of the archive has been the large
amount of network traffic generated by this site. To help ease this, we are now
inviting IP providers interested in providing better connectivity to
Sun SITE NE for their customers to step forward and discuss how we might proceed - for example, with direct connections.
Hopefully this will help ease the burden on already over-stretched
academic networks and improve our global connectivity. However, it has to be noted that a lot of Sun SITE NE traffic does currently go to
other UK academic sites, although improvements in academic network bandwidth
connectivity are also being discussed.
4.7 Sun SITE NE - The Next Generation
Sun SITE NE has been phenomenally successful, so much so that there are times
when it is simply swamped with users and response suffers.
To help solve this problem, a significant upgrade has been proposed to Sun
and as of mid-August 1996, it is reported to have been agreed to
and the final configuration just has to be confirmed.
All being well, the new system is expected to comprise
the present system plus a very powerful top-of-the-range
Sun Ultra Enterprise server.
It is hoped that the combined power of these two servers will provide
a sound platform for the future and the capacity to support interesting
new services.
5 Conclusion
What started as a small, convenient file archive for a few keen programmers has grown
into one of the world's largest and very popular archives, providing not just a
vast collection of public access files collected from all around the world, but a web cache,
a search engine, and a whole lot more. With hopefully any growing pains now over, Sun SITE Northern Europe looks like having a bright and promising future.
Bibliography
- 1
- http://sunsite.doc.ic.ac.uk/
- 2
- http://www.doc.ic.ac.uk/f?/lmjm
- 3
- http://www.ukuug.org/
- 4
- http://www.sun.com/
- 5
- http://www.quantum.com/quantum.html
- 6
- http://www.park.org/
- 7
- http://www.sun.com/products-n-solutions/hw/peripherals/array.200.html
- 8
- http://www.quantum.com/products/manuals/gp-scsi-manual/chap2.html#2.1
- 9
- http://www.trimm.com/
- 10
- http://www.cmd.com/
- 11
-
http://www.sun.com/products-n-solutions/hw/servers/ultra_enterprise/6000/index.html
Return to Conference Proceedings