Publications » Data Mining
Performance is an open issue in data intensive applications. Finding the best implementation and influential performance factors of hardware and software platforms for the data intensive applications requires trial and error. However, it is very difficult and costly to perform these trials in a real large-scale environment. In this paper we use a generic simulation framework SIMCAN to simulate hardware and software platforms of data intensive applications for investigating the influential performance factors, and thereby making decisions on the design of data intensive application architectures. We have employed a typical use case of a data mining application, in which the architecture has been proposed using a pipeline model. We have simulated various scenarios to investigate factors that affect the system performance to assist the architecture design and the simulation results provide useful information for this decision- making.
This paper presents the rationale for a new architecture to support a significant increase in the scale of data integration and data mining. It proposes the composition into one framework of (1) data mining and (2) data access and integration. We name the combined activity DMI. It supports enactment of DMI processes across heterogeneous and distributed data resources and data mining services. It posits that a useful division can be made between the facilities established to support the definition of DMI processes and the computational infrastructure provided to enact DMI processes. Communication between those two divisions is restricted to requests submitted to gateway services in a canonical DMI language. Larger-scale processes are enabled by incremental refinement of DMI-process definitions often by recomposition of lower-level definitions. Autonomous evolution of data resources and services is supported by types and descriptions which will support detection of inconsistencies and semi-automatic insertion of adaptations. These architectural ideas are being evaluated in a feasibility study that involves an application scenario and representatives of the community.
Within the context of the EU Design Study Developmental Gene Expression Map, we identify a set of challenges when facilitating collaborative research on early human embryo development. These challenges bring forth requirements, for which we have identified solutions and technology. We summarise our solutions and demonstrate how they integrate to form an e-infrastructure to support collaborative research in this area of developmental biology.
Wavelet transform has been proved to be a powerful tool for characterizing network traffic. However, the resulting decomposition of wavelet transform typically forms the high-dimension space. It is obviously problematic on compact representation, visualization, and modeling based on these high-dimensional data. In this study, we employ data projection techniques to represent the high dimensional wavelet decomposition of network traffic data in a low dimensional space to facilitate the visual analysis of network traffic pattern. A low-dimensional representation can significantly reduce the model complexity, and the features of traffic pattern can be presented with a small number of parameters. The experimental results show that the proposed method could effectively discriminate the different application flows, such as FTP and P2P data flows.
The Edinburgh Mouse Atlas aims to capture in-situ gene expression patterns in a common spatial framework. In this study, we construct a grammar to define spatial regions by combinations of these patterns. Combinations are formed by applying operators to curated gene expression patterns from the atlas, thereby resembling gene interactions in a spatial context. The space of combinations is searched using an evolutionary algorithm with the objective of finding the best match to a given target pattern. We evaluate the method by testing its robustness and the statistical significance of the results it finds.
We analyse data from the Edinburgh Mouse Atlas Gene-Expression Database (EMAGE) which is a high quality data source for spatio-temporal gene expression patterns. Using a novel process whereby generated patterns are used to probe spatially-mapped gene expression domains, we are able to get unbiased results as opposed to using annotations based predefined anatomy regions. We describe two processes to form association rules based on spatial configurations, one that associates spatial regions, the other associates genes.
A limited number of hardcopies is available for those who are interested, drop me an e-mail. Contents (chapter level): - 1. Introduction
- 2. Evolutionary Computation
- Part I: Constraint Satisfaction
- 3. Constraint Satisfaction problems
- 4. Solving Constraint Satisfaction Problems
- 5. Empirical Research on Constraint Satisfaction
- 6. Measuring the Resampling Ratio
- 7. Constraint Satisfaction: Conclusions
- Part II: Data Mining
- 8. Introduction
- 9. Classification
- 10. Symbolic Regression
- 11. Data Mining Conclusions
- 12. Dynamic Behaviour
- 13. Bridging the Gap
- A. RandomCSP Library
- B. Library for Evolutionary Algorithm Programming
- C. Case Study: Scheduling a Telescope

In this paper we continue our study on adaptive genetic pro-gramming. We use Stepwise Adaptation of Weights to boost performance of a genetic programming algorithm on simple symbolic regression problems. We measure the performance of a standard GP and two variants of SAW extensions on two different symbolic regression prob-lems from literature. Also, we propose a model for randomly generating polynomials which we then use to further test all three GP variants.
In this paper we continue study on the Stepwise Adaptation of Weights (SAW) technique. Previous studies on constraint satisfaction and data clas-sification have indicated that SAW is a promising technique to boost the performance of evolutionary algorithms. Here we use SAW to boost per-formance of a genetic programming algorithm on simple symbolic regression problems. We measure the performance of a standard GP and two variants of SAW extensions on two different symbolic regression problems.
This article is a combined summary of two papers written by the authors. Binary data classification problems (with exactly two disjoint classes) form an important application area of machine learning techniques, in particular genetic programming (GP). In this study we compare a number of different variants of GP applied to such problems whereby we investigate the effect of two significant changes in a fixed GP setup in combination with two different evolutionary models
In this paper we report the results of a comparative study on different variations of genetic programming applied on binary data classification problems. The first genetic programming variant is weighting data records for calculating the classification error and modifying the weights during the run. Hereby the algorithm is defining its own fitness function in an on-line fashion giving higher weights to `hard' records. Another novel feature we study is the atomic representation, where `Booleanization' of data is not performed at the root, but at the leafs of the trees and only Boolean functions are used in the trees' body. As a third aspect we look at generational and steady-state models in combination of both features.
In this paper we describe how the Stepwise Adaptation of Weights (SAW) technique can be applied in genetic programming. The SAW-ing mechanism has been originally developed for and successfully used in EAs for constraint satisfaction problems. Here we identify the very basic underlying ideas behind SAW-ing and point out how it can be used for different types of problems. In particular, SAW-ing is well suited for data mining tasks where the fitness of a candidate solution is composed by `local scores' on data records. We evaluate the power of the SAW-ing mechanism on a number of benchmark classification data sets. The results indicate that extending the GP with the SAW-ing feature increases its performance when different types of misclassifications are not weighted differently, but leads to worse results when they are.
Supervised by A.E. Eiben and E. Marchiori
|