Data Mining in a Scientific Environment

Jagoda Crawford & Frank Crawford
Information Management, ANSTO,
PMB 1, Menai NSW 2234 Australia
Email: jc@ansto.gov.au
           frank@ansto.gov.au

Abstract

Data Mining is a concept that is taking off in the commercial sector as a means of finding useful information out of gigabytes of data. While products for the commercial environment are starting to become available, tools for a scientific environment are much rarer (or even non-existent). Yet scientists have long had to search through reams of printouts and rooms full of tapes to find the gems that make up scientific discovery.

This paper will explore some of the ad hoc methods generally used for Data Mining in the scientific community, including such things as scientific visualisation, and outline how some of the more recently developed products used in the commercial environment can be adapted to scientific Data Mining.

1 Introduction

The advent of the computer has brought with it the ability to generate and store huge amounts of data. For example, it is not unusual for power users to have the equivalent of three or four encyclopedias worth of data online. When you add the data generated by government and other organisations, such as the recently completed census or the data collected every time you make a purchase at any modern supermarket, the volume of data available is almost incomprehensible. The problem is in how to turn this data into usable information.

However, this is not a new phenomenon. Scientists, especially experimentalists, have long had to tackle this problem. While Isaac Newton may have formulated his theory of gravity when an apple fell on his head, it was still followed by hundreds, if not thousands, of experiments demonstrating, validating and/or refining the original equation. Taking more recent examples, the volume of data generated by space probes and particle physics, dwarfs anything previously contemplated. Looking closer to home, scientists at ANSTO often analyse data generated over their entire working career of twenty or thirty years.

Over the centuries, various methods have been developed to deal with this volume of data, many of which were seen as major steps forward for mathematics at the time. Some of these methods include Fast Fourier Transforms, Multivariate Regression Analysis, as well as a whole range of statistical methods. More recently, Visualisation has been widely adopted by scientists as a means of studying the ever-growing masses of data.

2 What Is Data Mining?

With the current trends in centralisation of an organisation's data in large databases, particularly in a commercial environment, the process of extracting useful information has become more formalised and the term Data Mining has been coined for it. In one of the first papers on commercial Data Mining, Evangelos Simoudis of IBM defined it as:

The process of extracting previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions (Simoudis 1995).

This definition has a definite business flavour and much of IBM's development of Data Mining has been in this direction. In practice, Data Mining is a process which can take on different approaches depending on the type of data involved and the objectives desired. As this is still very much an evolving discipline, much work is being undertaken to determine standard processes for the varied environments. Further, as the context in which the data is gathered is often an important component, this must be factored into any analysis.

Data Mining consists of three components: the captured data, which must be integrated into organisation-wide views, often in a Data Warehouse; the mining of this warehouse; and the organisation and presentation of this mined information to enable understanding.

The data capture is a fairly standard process of gathering, organising and cleaning up; for example, removing duplicates, deriving missing values where possible, establishing derived attributes and validation of the data. Much of the following processes assume the validity and integrity of the data within the warehouse. These processes are no different to any data gathering and management exercise.

The Data Mining process itself is the extraction of valid and previously unknown information, as given in the definition above. There are two approaches: verification driven, whose aim is to validate a hypothesis postulated by a user, or discovery driven, which is the automatic discovery of information by the use of appropriate tools.

The Data Mining process is not a simple function, as it often involves a variety of feedback loops since while applying a particular technique, the user may determine that the selected data is of poor quality or that the applied techniques did not produce the results of the expected quality. In such cases, the user has to repeat and refine earlier steps, possibly even restarting the entire process from the beginning. This is best illustrated in the following figure from Simoudis' paper.


Figure 1: The Data Mining Process

The final step is the presentation of the information in a format suitable for interpretation. This may be anything from a simple tabulation to video production and presentation. The techniques applied depend on the type of data and the audience as, for example, what may be suitable for a knowledgeable group would not be reasonable for a naive audience.

2.1 Verification Driven Data Mining

Currently, the most common use of Data Mining is verification driven, and is primarily aimed at confirmation of an idea. Generally, the mechanism is to propose some association or pattern and then to study the data to find support, or otherwise, for the proposal.

There are a number of standard techniques used in verification driven mining; these include the most basic form of query and reporting, presenting the output in graphical, tabular and textual forms, through to multi-dimensional analysis and on to statistical analysis.

2.2 Discovery Driven Data Mining

The discovery driven approach depends on a much more sophisticated and structured search of the data for associations, patterns, rules or functions, and then having the analyst review them for value. The current techniques for performing discovery driven mining consists of four different approaches: predictive modelling including neural nets, link-analysis technique which attempts to establish links between records, database segmentation which partitions the data into collections of related records, and finally deviation detection which identifies point that do not fit in a segment.

3 Traditional Uses of Data Mining

Within the business world, Data Mining is being seen as a method of tapping into the value of the data with an organisation and providing a competitive advantage. An example of this is the analysis of purchase histories, drawn from credit card transactions, preferred customer schemes, frequent shopper schemes and any other purchasing data which includes customer information. Using a method called neural segmentation, a number of different types of purchase patterns can be identified and then customer groupings can be associated with this data.

For instance, such analysis of shopping has identified two groups of people who purchase baking items, the first being older, retired couples, and the second, young couples with large families. The next step may be to look at product linkage; for example, there may be a group of people who purchase men's suits, women's high fashion shoes, men's ties and expensive chocolates. They do not buy baby clothes, housewares and greeting cards. This indicates that a store may be able to bring in more customers for a sale of suits if they have chocolates for half-price, or better yet, give away the chocolates.

These procedures can be used further for the analysis of any activity that generates large volumes of data, from specific surveys through to the collection of operational data, such as stock movements, or point-of-sale information. An example of this is Market Basket Analysis, which refers to the discovery of patterns within items purchased as is illustrated by such correlations between the purchase of paint and paint brushes or paint thinner. These associations can then be used to determine shelf locations and promotional sales planning.

Such analysis is the main force driving the introduction of Data Mining within large organisations and, thus, the current interest in such research. It is invariably related to the interrogation of large volumes of data, using high performance systems and massive amounts of storage. However, there is still the need to apply some commonsense to the results as spurious patterns and associations may be found. It is quite possible for an association to be found between the purchase of paint and cat food, which may be caused by other factors that were not part of the original analysis.

Most commonly, Data Mining is a single step in the entire process of Decision Support, and fits into the general process: Data Warehouse - Data Mining - Decision Support.

4 Data Mining in a Scientific Environment

While IBM may be driving Data Mining in the commercial marketplace, the origins are in fact in scientific computing with considerable work being done at UCLA and the University of Helsinki. Some of the original work was on Geophysical databases in an attempt to process some of the large volumes of data they have available.

What is not considered in much of the work on Data Mining is that most, if not all, of this work is just as applicable to the scientific environment. One of the critical issues with Data Mining is a credibility check being performed by someone who is aware of the field. Most scientists, and in particular experimentalists, have a great respect for their data, being well aware of the dangers of using inapplicable methods for analysis. An excellent example of this is given in Clifford Stoll's new book Silicon Snake Oil, in which he describes a study by an astronomer, Professor Li Fang, into the periodic motions of the earth's axis. This study involved the analysis of thousands of years of astronomical measurements. Dr Li had performed all the measurements by hand and Clifford was attempting to show him how easy it would have been with a computer. On presenting his results, Dr Li replied:

When I compare the computer's results to my own, I see that an error has crept in. I suspect it is from the computer's assumption that our data is perfectly sampled throughout history. Such is not the case, especially during the Sung dynasty. And so, it may be that we need to analyze the data in a slightly different manner.

Having a computer, I had naturally cast the problem as simple data analysis. ... The real challenge was understanding the data and finding a good way to use it (Stoll 1995).

The underlying principles of the Science Method, being the cycle observation-hypothesis-experiment fits well with the processes of Data Mining with discovery driven mining working well for the observation-hypothesis step and the verification driven mining for the hypothesis-experiment step. As scientists have been working with this principle for centuries, and as most mathematics has been intended to support such scientific endeavours, many, if not all, of the methods are already being used by them. In many cases, the only change is in the terminology, not in the practice.

The final stage in any Data Mining is the presentation of results and this has both a very long history and is an area of rapid change in scientific work. This stems from simple graphs that scientists have long studied through to the latest techniques in visualisation being demonstrated on high performance graphics workstations.

5 Examples of Scientific Data Mining

One example of the scientific analysis of such data found in farming and the environment, is optimisation of crop yield while minimising the resources supplied. To minimise the resources, it is necessary to identify what factors affect the crop yield, out of such items as chemical fertilisers and additives; for example, phosphate, the moisture content and type of the soil.

One analysis looked at over 64 separate items measured over a number of years to extract the items that were significant. Initially the analysis was discovery driven mining to attempt to find what parameters were significant, either by themselves or in conjunction with others. Using such statistical methods as multivariate regression analysis, the parameters that are significant and their relative influence was determined. From this, an equation was developed, which was then further verified through verification driven mining against new datasets.

Of more general interest, global climate change studies, a hot research area is primarily a verification driven mining exercise. Climate data has been collected for many centuries and is being extended into the more distant past through such activities as analysis of ice core samples from the Antarctic and, at the same time, a number of different predictive models have been proposed for future climatic conditions. The sample data is then used to verify these models by seeing if they accurately predict past conditions which can then be compared with the sample data. From this, the models are then further refined and used for another round of verification driven mining.

6 Conclusion

Data Mining is a new term and formalism for a process that has been undertaken by scientists for generations. The massive increase in the volume of data collected or generated for analysis with the use of computers has made it an essential tool. However, despite the more formal approach, Data Mining is something that scientists perform on an ad hoc basis and can easily adapt to. Many of the methods used for the analysis of the data were originally developed to process scientific data and are used unchanged.

As a final point, the biggest of all, the Internet, is becoming more and more important, and while there is useful information, extracting that from the terabytes being added daily is an enormous task. The techniques of Data Mining are applicable here more than any other domain. However, to make use of it takes time, effort and, above all, people with a knowledge of the field, to differentiate the true solutions from the infeasible.

Bibliography

1
IBM (1995) Data Mining - An IBM Overview, IBM Almaden Research Centre.
URL: http://www.almaden.ibm.com/stss/papers/overview.html

2
Simoudis, E. (1995) Reality check for data mining, IBM Almaden Research Centre.
URL: http://www.almaden.ibm.com/stss/papers/reality/

3
UCLA Data Mining Laboratory (1996) UCLA Data Mining Laboratory Publications, UCLA.
URL: http://nugget.cs.ucla.edu:8001/publications/index.html

4
Data Mining Group at University of Helsinki (1996) Data Mining Group at University of Helsinki, Department of Computer Science, University of Helsinki.
URL: http://www.cs.Helsinki.FI/research/pmdm/datamining/

5
Stoll, C. (1995) Silicon Snake Oil - Second Thoughts on the Information Highway, Pan Books, London.

6
Baird, D. C. (1962) Experimentation: An Introduction to Measurement Theory and Experiment Design, Prentice Hall Inc., Englewood Cliffs, New Jersey.

7
Miller, B. (1996) Data Mining: Dealing with Data in Proceedings of UniForum NZ '96, UniForum NZ.


Organised by: AUUG'96 & CSU Return to Conference Proceedings