This paper will explore some of the ad hoc methods generally used for Data Mining in the scientific community, including such things as scientific visualisation, and outline how some of the more recently developed products used in the commercial environment can be adapted to scientific Data Mining.
However, this is not a new phenomenon. Scientists, especially experimentalists, have long had to tackle this problem. While Isaac Newton may have formulated his theory of gravity when an apple fell on his head, it was still followed by hundreds, if not thousands, of experiments demonstrating, validating and/or refining the original equation. Taking more recent examples, the volume of data generated by space probes and particle physics, dwarfs anything previously contemplated. Looking closer to home, scientists at ANSTO often analyse data generated over their entire working career of twenty or thirty years.
Over the centuries, various methods have been developed to deal with this volume of data, many of which were seen as major steps forward for mathematics at the time. Some of these methods include Fast Fourier Transforms, Multivariate Regression Analysis, as well as a whole range of statistical methods. More recently, Visualisation has been widely adopted by scientists as a means of studying the ever-growing masses of data.
The process of extracting previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions (Simoudis 1995).
This definition has a definite business flavour and much of IBM's development of Data Mining has been in this direction. In practice, Data Mining is a process which can take on different approaches depending on the type of data involved and the objectives desired. As this is still very much an evolving discipline, much work is being undertaken to determine standard processes for the varied environments. Further, as the context in which the data is gathered is often an important component, this must be factored into any analysis.
Data Mining consists of three components: the captured data, which must be integrated into organisation-wide views, often in a Data Warehouse; the mining of this warehouse; and the organisation and presentation of this mined information to enable understanding.
The data capture is a fairly standard process of gathering, organising and cleaning up; for example, removing duplicates, deriving missing values where possible, establishing derived attributes and validation of the data. Much of the following processes assume the validity and integrity of the data within the warehouse. These processes are no different to any data gathering and management exercise.
The Data Mining process itself is the extraction of valid and previously unknown information, as given in the definition above. There are two approaches: verification driven, whose aim is to validate a hypothesis postulated by a user, or discovery driven, which is the automatic discovery of information by the use of appropriate tools.
The Data Mining process is not a simple function, as it often involves a variety of feedback loops since while applying a particular technique, the user may determine that the selected data is of poor quality or that the applied techniques did not produce the results of the expected quality. In such cases, the user has to repeat and refine earlier steps, possibly even restarting the entire process from the beginning. This is best illustrated in the following figure from Simoudis' paper.
The final step is the presentation of the information in a format suitable for interpretation. This may be anything from a simple tabulation to video production and presentation. The techniques applied depend on the type of data and the audience as, for example, what may be suitable for a knowledgeable group would not be reasonable for a naive audience.
There are a number of standard techniques used in verification driven mining; these include the most basic form of query and reporting, presenting the output in graphical, tabular and textual forms, through to multi-dimensional analysis and on to statistical analysis.
For instance, such analysis of shopping has identified two groups of people who purchase baking items, the first being older, retired couples, and the second, young couples with large families. The next step may be to look at product linkage; for example, there may be a group of people who purchase men's suits, women's high fashion shoes, men's ties and expensive chocolates. They do not buy baby clothes, housewares and greeting cards. This indicates that a store may be able to bring in more customers for a sale of suits if they have chocolates for half-price, or better yet, give away the chocolates.
These procedures can be used further for the analysis of any activity that generates large volumes of data, from specific surveys through to the collection of operational data, such as stock movements, or point-of-sale information. An example of this is Market Basket Analysis, which refers to the discovery of patterns within items purchased as is illustrated by such correlations between the purchase of paint and paint brushes or paint thinner. These associations can then be used to determine shelf locations and promotional sales planning.
Such analysis is the main force driving the introduction of Data Mining within large organisations and, thus, the current interest in such research. It is invariably related to the interrogation of large volumes of data, using high performance systems and massive amounts of storage. However, there is still the need to apply some commonsense to the results as spurious patterns and associations may be found. It is quite possible for an association to be found between the purchase of paint and cat food, which may be caused by other factors that were not part of the original analysis.
Most commonly, Data Mining is a single step in the entire process of Decision Support, and fits into the general process: Data Warehouse - Data Mining - Decision Support.
What is not considered in much of the work on Data Mining is that most, if not all, of this work is just as applicable to the scientific environment. One of the critical issues with Data Mining is a credibility check being performed by someone who is aware of the field. Most scientists, and in particular experimentalists, have a great respect for their data, being well aware of the dangers of using inapplicable methods for analysis. An excellent example of this is given in Clifford Stoll's new book Silicon Snake Oil, in which he describes a study by an astronomer, Professor Li Fang, into the periodic motions of the earth's axis. This study involved the analysis of thousands of years of astronomical measurements. Dr Li had performed all the measurements by hand and Clifford was attempting to show him how easy it would have been with a computer. On presenting his results, Dr Li replied:
When I compare the computer's results to my own, I see that an error has crept in. I suspect it is from the computer's assumption that our data is perfectly sampled throughout history. Such is not the case, especially during the Sung dynasty. And so, it may be that we need to analyze the data in a slightly different manner.Having a computer, I had naturally cast the problem as simple data analysis. ... The real challenge was understanding the data and finding a good way to use it (Stoll 1995).
The underlying principles of the Science Method, being the cycle observation-hypothesis-experiment fits well with the processes of Data Mining with discovery driven mining working well for the observation-hypothesis step and the verification driven mining for the hypothesis-experiment step. As scientists have been working with this principle for centuries, and as most mathematics has been intended to support such scientific endeavours, many, if not all, of the methods are already being used by them. In many cases, the only change is in the terminology, not in the practice.
The final stage in any Data Mining is the presentation of results and this has both a very long history and is an area of rapid change in scientific work. This stems from simple graphs that scientists have long studied through to the latest techniques in visualisation being demonstrated on high performance graphics workstations.
One analysis looked at over 64 separate items measured over a number of years to extract the items that were significant. Initially the analysis was discovery driven mining to attempt to find what parameters were significant, either by themselves or in conjunction with others. Using such statistical methods as multivariate regression analysis, the parameters that are significant and their relative influence was determined. From this, an equation was developed, which was then further verified through verification driven mining against new datasets.
Of more general interest, global climate change studies, a hot research area is primarily a verification driven mining exercise. Climate data has been collected for many centuries and is being extended into the more distant past through such activities as analysis of ice core samples from the Antarctic and, at the same time, a number of different predictive models have been proposed for future climatic conditions. The sample data is then used to verify these models by seeing if they accurately predict past conditions which can then be compared with the sample data. From this, the models are then further refined and used for another round of verification driven mining.
As a final point, the biggest of all, the Internet, is becoming more and more important, and while there is useful information, extracting that from the terabytes being added daily is an enormous task. The techniques of Data Mining are applicable here more than any other domain. However, to make use of it takes time, effort and, above all, people with a knowledge of the field, to differentiate the true solutions from the infeasible.
Return to Conference Proceedings