Local time: Thursday, 04-Dec-2008 10:46:09 EST
Last update: at /special/conference/apwww95 , Friday, 21-May-2004 09:47:15 EST
![]()
Performance Management for Large Commercial Environments
John PalmerVice President of Technology Open Enterprise Systems
Amdahl Corporation
- Abstract
- This presentation discusses the needs in measuring, monitoring, and reporting performance information in a large, heterogeneous, commercial enterprise. We will discuss UMA, the emerging standard for collecting, managing, and distributing data.
First we will discuss Amdahls systems management model. Then we will discuss the issues surrounding performance management in a heterogeneous environment. Next, we will discuss the architecture of the new collection, management, and distribution capabilities for performance data.
We will describe in detail the Universal Measurement Architecture (UMA) followed by the products A+UMA, A+OpenWatch and A+OpenTune. Next, we will identify the benefits of using performance management products like A+UMA, A+OpenWatch and A+OpenTune on open systems. Finally, we will summarise, and have an open discussion on where we do from here.
Commercial, production applications are now being brought up on Open Systems platforms. These platforms, due to the nature of the demands being placed on them, are selected to provide high capacity and high availability, so there is a critical need to provide performance management. This is necessary so that service levels can be promised and met, and planning for future needs can take place.
This presentation will discuss a new architecture as well as the products which Amdahl is currently offering.
Products which are needed in the Systems Management arena can be grouped into 6 functions. These functions are Financial Control, Planning, Security, Operational Control & Administration, Problem Management and Change Management. The general product needs fall into these functions which can be executed or applied to any IT asset or object. This presentation will concentrate on performance management and capacity planning, consistently among the highest priorities.
Amdahls systems management product offerings are sold under the umbrella of A+.
The theme of this presentation is to discuss the concept that in order to manage Open Systems environments, particularly heterogeneous distributed environments, you need to gather, manage, and distribute performance data.
First, there is no standard performance collection methodology. This means that data collection is different for each platform and there is no standard way to gather and analyse performance data across dissimilar platforms.
Next, the current methodologies for collecting data are not reliable. There is no way to make sure that the performance data is collected and managed so that users can be assured that the data is available. If a collector goes down, it needs to be manually started again.
The current collection capabilities require that users collect their own data. Obviously, this greatly increases the data collection overhead. In particular, if a service level problem exists at a particular time, what usually happens is that a number of analysts get on the machine and try to determine what is happening. In so doing, each collects his own data and each additional data collection adds to the overhead and thereby worsens the original problem.
The current data collection techniques provide data records that are not synchronised. For example, CPU data is collected over one time interval and I/O data is collected over another time interval. This makes it very difficult to correlate the data in the two records.
As discussed earlier, it is very difficult to make sense of performance data collected in a heterogeneous, distributed environment. Applications are now being distributed across different platforms but there is no good way to tie all of the performance data together.
We also need better data - from the kernel, from applications, and from major printer complexes. For example, the costs of printer environments are now approaching the costs of large servers but there currently are no good ways to manage this large investment.
A user will start a transaction such as requesting data. Some editing of that request will take place on the workstation and then the request will be transmitted over a network through a transaction management platform to a platform which actually contains the data. The data will then be shipped back to the requester.
A performance problem may occur at each step along the way. Measurement must take place on all platforms (and the network) and then the data needs to be brought to one place and then married together so that the transaction can be tracked and the performance problem identified.
Originally, the Performance Management Working Group (PMWG) was formed as part of UNIX International in order to design a standard methodology for collecting, managing, and distributing performance data in heterogeneous, distributed open environments.
With the demise of UI, PMWG is now being sponsored by the Computer Measurement Group (CMG). PMWG submitted the UMA specifications to X/Open in March 1994 for approval as a standard. In January of 1995, the Sysman committee of X/Open unanimously accepted the specifications as preliminary documents.
The list of participating organisations in PMWG includes Amdahl, IBM, Sequent, Hewlett Packard, Sun, AT&T GIS and Boeing As you can see, the major open systems and performance management vendors and several large end-users are represented.
In addition, as you would expect, most of these same companies belong to X/OPEN and provide funding for that organisation.
With this kind of support, its easy to see that UMA will be widely supported in the near future. In fact, we are now beginning to see UMA specified in procurement documents from large IT users.
UMA is designed to collect, manage, and distribute data to Measurement Application Programs (MAPs) that request the data through standard interfaces. This makes it easy for vendors who specialise in performance management applications to write applications for the open systems environments.
UMA defines a small set of records that are required, allowing MAP vendors to depend on that set. It also allows new data collectors for other data sources, such as RDBMSs, and allows platform-specific data to be added, as Amdahl has for UTS on the S/390 Compatible platform.
The collection of data by UMA has to be done using minimal CPU resources. In fact, Amdahl’s development goal is to use no more than 5% of the CPU for collecting UMA data.
UMA has the goal to provide access to both real time and historical data without changing applications. This allows the administrator to fix critical problems and do long term planning without having to learn multiple applications.
Of course, UMA is intended to support heterogeneous systems to meet the needs of users who already own systems from multiple vendors and want a consistent, performance management architecture. While UMA is supported by UNIX systems vendors, it allows for other systems to be included in the managed environment.
The UMA Architecture calls for four layers, and two interfaces. Starting from the bottom, the first layer is the data capture layer. It is responsible for collecting raw data. These modules are also referred to as “providers”.
The providers communicate with the Measurement Control Layer and the Data Services Layer through the Data Capture Interface (DCI), one of the UMA defined APIs.
The Measurement Control Layer schedules and synchronises data collection.
The Data Services Layer manages the UMA data. It accepts requests from the Measurement Application Programs or MAPS and provides the data. The Data Services Layer communicates with the MAPs through the Measurement Layer Interface (MLI).
The Measurement Layer Interface, another UMA defined API, isolates the MAPs from the implementation details of the other UMA components. This isolation provides for the extension of collectors and the addition of collectors without having to rewrite the MAPs. It is a key element allowing products from multiple vendors to work together.
At the top are MAPs (Measurement Application Programs) that request data through the Measurement Layer Interface. Through this interface, the application defines the filtering of information through standard APIs to request data from the DCI. Only the data requested by the MAP is sent across the network to the application.
What UMA provides is the capability to collect performance data from heterogeneous platforms onto a performance server and then access that data from any location in the enterprise in order to analyse the data.
For real time tuning, the user can access the data directly off the platform where the data was collected. The data can then be moved to a performance server during off hours for future analysis for capacity planning purposes.
Now I’d like to describe the emerging UMA standard and how it can help address distributed systems management problems. UMA provides the following for distributed systems:
- Wide vendor endorsement to address the user with heterogeneous systems.
- Open, published interfaces so that products from multiple vendors work together.
- A minimum defined set of data prints, so that vendors can expect the same data.
- Extensibility for both additional collectors, and vendor specific data.
Amdahl is announcing three products in the performance management arena. The first product is the A+UMA Performance Data Manager which collects data from the kernel, manages this data and then makes it available to any user who wants to access the data. A+UMA is Amdahl’s implementation of PMWG’s specification of the UMA architecture.
The A+OpenWatch Distributed Threshold Monitor is a product which monitors many different platforms and looks for thresholds which are being exceeded. When a threshold is exceeded, the user can either look into the problem in detail by drilling down using the A+OpenTune product or send an SNMP trap to a network manager. A+OpenTune can also be invoked from the network manager.
The A+OpenTune Performance Monitor allows the user to look in detail at both current and historical performance data in order to determine bottlenecks in the system and thereby tune it.
Other future products/services which will be able to use A+UMA data will include capacity planning and chargeback applications.
The performance management industry has learned a lot by using performance management products over the last 25 years in an MVS environment. Members of the PMWG have applied this experience in designing the architecture of UMA.
A+UMA has been implemented based on the specifications developed by the PMWG. The MLI and the Data Services Layer have been implemented in A+UMA. The DCI will be implemented as soon as the specification stabilises.
A+UMA has not only been implemented based on the UMA standard, but additionally incorporates functions to increase reliability of data collection. A watchdog daemon watches for problems occurring in the collection process and restarts the collection process if a problem occurs.
A+OpenTune is a Measurement Application Program or MAP that uses A+UMA data for input, and provides a graphical view of the A+UMA instrumented server environment. Examples of the data displayed by A+OpenTune are listed. You can see that it includes the key data for performance management.
A+OpenTune is based upon Motif, providing a powerful and easy to use graphical user interface.
With Hypertext-sensitive help, the user can click on a highlighted word in the help text to receive additional information.
A+OpenTune contains a mechanism to provide both audible and visual alarms when an exception condition is encountered. The user can set the thresholds for exceptions through an easy pull-down menu, and can change those settings with the same menu.
A+OpenTune can display current data, historical data from the A+UMA data files, or data from private files.
All these A+OpenTune features combine to provide the system administrator with the data required to respond to critical situations, tune systems, and analyse historical data for future planning. Because it’s based on UMA, it will integrate with other UMA-based products.
Let’s look at a typical configuration. A+UMA is installed and runs on each server being monitored. A+OpenTune runs on a Solaris workstation and provides displays of the performance variables from the server being monitored. The A+UMA data is communicated to A+OpenTune via TCP/IP.
The user may run more than one window of A+OpenTune, looking at more than one server. Or he may shut down and start up one A+OpenTune at a time looking at a different server with each startup of the MAP.
A+OpenWatch provides the capability for a single administrator to monitor many distributed systems. It provides administrator-defined thresholds for individual systems or groups of systems, and then only notifies the administrator when a threshold is exceeded. Thresholds can be defined for any UMA value. Visual and audible alarms can be initiated when thresholds are exceeded.
For problem identification and resolution, a copy of A+OpenTune can be invoked, automatically directed to analyse the violating system.
A+OpenWatch can send a trap to an SNMP manager for administrator notification, protecting your investment in administrator training and utilising installed products.
The same powerful Motif-based GUI is used in A+OpenWatch as in A+OpenTune, reducing training.
Of course, Hypertext-sensitive help is available as with A+OpenTune.
A+UMA collects performance data on each platform in your heterogeneous distributed environment. Through A+OpenWatch, you can set thresholds in order to be warned if performance problems occur on any platform.
A+OpenWatch then allows the user to go directly to A+OpenTune to determine the cause of the performance problem. A+OpenWatch can also send a trap to a network manager. That product can then execute a script to perform a specific function (such as paging a person) to handle the problem.
A+UMA is generally available on UTS Version 2.1.5 and later, UTS Version 4, Solaris 2.3 and 2.4. It also runs on AIX, A+Edition, and SunOS. As at August 1995 versions for Hewlett Packard and Pyramid systems are in beta test status.
Collection of data on specific processes is now in the product.
Amdahl is continuing to plan support for A+UMA on additional platforms. Other system vendors are also planning to support UMA on their platforms.
Amdahl’s current plan is that A+OpenWatch and A+OpenTune will initially run in the Solaris environment, while A+UMA collectors run on a variety of platforms. Thus, from one Solaris workstation, the customer will be able to monitor an entire distributed, heterogeneous server environment.
SNMP is just as the name implies - simple. It was set up to look at a little data from a lot of sources. Its uses are for configuration management and for warning users of global metrics exceeding thresholds. If users want to look at more detailed data in order to find the cause of a problem, they will have to look at historical data from some source other than SNMP. We suggest UMA data to find and fix problems. UMA has been designed to collect and manage large amounts of performance data so that the user can determine if today’s problem is just an anomaly or if it is a growing problem. The important point is that historical data collected at a very high frequency is often needed to solve performance problems.
SNMP is also not efficient at collecting large amounts of data. An agent only collects data when it is requested to collect data at a particular time. In other words, there is network traffic for the request to collect data and then there is network traffic to pass the data itself. In UMA, the MAP sends a request once to collect data for a long period of time and then the data is collected on the server and only sent to the MAP when the MAP wants it.
UMA can collect lots of data with a very low overhead. We have measured the data collection and management of the data to be .35% of a Solaris machine (data collection was on a 10 second interval with no process data being collected).
In addition to implementing the UMA architecture, Amdahl has added:
- A powerful disk display that allows the administrator to view many disks at once, looking for high usage and bottlenecks, or to view individual disk performance
- Additional RAS functions that allow A+UMA to recover from some errors that could cause UMA to stop collecting data
- A reporting tool that produces tabular forms of the performance data. With UBR (UMA Basic Reporter), the customer can set up a schedule whereby he can receive a written (to printer or to a file) report on the activity of the system being monitored. This report could be scheduled on a regular basis (each morning?), or could be produced ad hoc. The data could also be fed into a spreadsheet.
- A standard Motif-based GUI is used to create powerful and easy to use graphical displays of the UMA data.
- Platform-specific extensions to the A+UMA collector for UTS, as allowed by the standard. We will continue to enhance our collectors and MAPs in the future to increase the product’s utility.
You know you have a critical requirement to manage your open systems environment. You need measurement data collected in a reliable fashion on each platform in order to accomplish this need.
[Return to Table of Contents]
COPYRIGHT © 1995 by AUUG95 and APWWW95 Charles Sturt University. ALL RIGHTS RESERVED. ISBN 1 875781 43 9