Personal
You have stumbled upon my hyperhome, welcome! Me being a scientist, this page is focused on my work. I currently lead a research group at the National e-Science Centre in the School of Informatics of the University of Edinburgh. If you are not into science, you might enjoy some of my photographs or you could take a look at some of my current or past projects.
Latest news
In this paper we address two research questions concerning workflows: 1) how do we abstract and catalogue recurring workflow patterns?; and 2) how do we facilitate optimisation of the mapping from workflow patterns to actual resources at runtime? Similar questions have been previously investigated in the context of optimising compilers: our aim here is to explore techniques that are applicable to large-scale workflow compositions, where the resources could change dynamically during the life-time of an application. We achieve this by introducing a mechanism for pattern abstraction and cataloguing, supported by runtime semantic bindings that are conditional to the execution parameters. Using a data mining application from the life sciences, we demonstrate this new approach experimentally.
Whether the goal is performance prediction, or insights into the relationships between algorithm performance and instance characteristics, a comprehensive set of meta-data from which relationships can be learned is needed. This paper provides a methodology to determine if the meta-data is sufficient, and demonstrates the critical role played by instance generation methods. Instances of the Travelling Salesman Problem (TSP) are evolved using an evolutionary algorithm to produce distinct classes of instances that are intentionally easy or hard for certain algorithms. A comprehensive set of features is used to characterise instances of the TSP, and the impact of these features on difficulty for each algorithm is analysed. Finally, performance predictions are achieved with high accuracy on unseen instances for predicting search effort as well as identifying the algorithm likely to perform best.
BACKGROUND:Microarray technology is a popular means of producing whole genome transcriptional profiles, however high cost and scarcity of mRNA has led many studies to be conducted based on the analysis of single samples. We exploit the design of the Illumina platform, specifically multiple arrays on each chip, to evaluate intra-experiment technical variation using repeated hybridisations of universal human reference RNA (UHRR) and duplicate hybridisations of primary breast tumour samples from a clinical study.RESULTS:A clear batch-specific bias was detected in the measured expressions of both the UHRR and clinical samples. This bias was found to persist following standard microarray normalisation techniques. However, when mean-centering or empirical Bayes batch-correction methods (ComBat) were applied to the data, inter-batch variation in the UHRR and clinical samples were greatly reduced. Correlation between replicate UHRR samples improved by two orders of magnitude following batch-correction using ComBat (ranging from 0.9833-0.9991 to 0.9997-0.9999) and increased the consistency of the gene-lists from the duplicate clinical samples, from 11.6% in quantile normalised data to 66.4% in batch-corrected data. The use of UHRR as an inter-batch calibrator provided a small additional benefit when used in conjunction with ComBat, further increasing the agreement between the two gene-lists, up to 74.1%.CONCLUSION:In the interests of practicalities and cost, these results suggest that single samples can generate reliable data, but only after careful compensation for technical bias in the experiment. We recommend that investigators appreciate the propensity for such variation in the design stages of a microarray experiment and that the use of suitable correction methods become routine during the statistical analysis of the data.
Objective Medical imaging acquired for clinical purposes can have several legitimate secondary uses in research projects and teaching libraries. No commonly accepted solution for anonymising these images exists because the amount of personal data that should be preserved varies case by case. Our objective is to provide a flexible mechanism for anonymising Digital Imaging and Communications in Medicine (DICOM) data that meets the requirements for deployment in multicentre trials. Methods We reviewed our current de-identification practices and defined the relevant use cases to extract the requirements for the de-identification process. We then used these requirements in the design and implementation of the toolkit. Finally, we tested the toolkit taking as a reference those requirements, including a multicentre deployment. Results The toolkit successfully anonymised DICOM data from various sources. Furthermore, it was shown that it could forward anonymous data to remote destinations, remove burned-in annotations, and add tracking information to the header. The toolkit also implements the DICOM standard confidentiality mechanism. Conclusion A DICOM de-identification toolkit that facilitates the enforcement of privacy policies was developed. It is highly extensible, provides the necessary flexibility to account for different de-identification requirements and has a low adoption barrier for new users.
In this study, we apply a methodology for rapid development of portlets for scientific computing to the domain of computational chemistry. We report results in terms of the portals delivered, the changes made to our methodology and the experience gained in terms of interaction with domain-specialists. Our major contributions are: several web portals for teaching and research in computational chemistry; a successful transition to having our development tool used by the domain specialist as opposed by us, the developers; and an updated version of our methodology and technology for rapid development of portlets for computational science, which is free for anyone to pick up and use.
Developmental Gene Expression Map (DGEMap) is an EU-funded Design Study, which will accelerate an integrated European approach to gene expression in early human development. As part of this design study, we have had to address the challenges and issues raised by the long-term curation of such a resource. As this project is primarily one of data creators, learning about curation, we have been looking at some of the models and tools that are already available in the digital curation field in order to inform our thinking on how we should proceed with curating DGEMap. This has led us to uncover a wide range of resources for data creators and curators alike. Here we will discuss the future curation of DGEMap as a case study. We believe our experience could be instructive to other projects looking to improve the curation and management of their data
Manual annotation of biological data cannot keep up with data production. Open annotation models using wikis have been proposed to address this problem. In this empirical study we analyse 36 years of knowledge collection by 738 authors in two Molecular Biology wikis (EcoliWiki and WikiPathways) and two knowledge bases (OMIM and Reactome). We first investigate authorship metrics (authors per entry and edits per author) which are power-law distributed in Wikipedia and we find they are heavy-tailed in these four systems too. We also find surprising similarities between the open (editing open to everyone) and the closed systems (expert curators only). Secondly, to discriminate between driving forces in the measured distributions, we simulate the curation process and find that knowledge overlap among authors can drive the number of authors per entry, while the time the users spend on the knowledge base can drive the number of contributions per author.
Anyone who is purchasing a flight using a web browser expects to be guided through this task: from choosing the possible routes, to finding suitable dates and times, and to paying with a credit card. Today, most researchers from any discipline will successfully use these web-based e-commerce systems to book flights to attend their conferences. When these researchers are then confronted with solving compute-intensive problems, they need not expect such elaborate web-based systems to enable their domain-specific tasks. Instead, they will have to deal with archaic command-line tools and in the best case they may have access to generic portals that mimic the technical complexity of the underlying infrastructure. These interfaces are expensive to use as they require much investment from the researchers in terms of training. Moreover, the laborious and intricate processes involved often lead to errors.
Motivation: Scientific web portals are seen as the way forward to improve upon the slow uptake in use of utility computing infrastructure and high-performance computing facilities. Currently, two types of portals exist: general-purpose portals and domain-specific portals. The first type closely resembles the underlying technical infrastructure of compute-job submission systems, thereby providing little appeal to a wide range of domain specialists. The second type is tailored to the application specifications and their end-users' requirements. Unfortunately, the technical complexity in domain-specific portals makes these expensive and time-consuming to develop and maintain. Clearly, an alternative to these two approaches is required. Results: We introduce an approach, Rapid, that facilitates rapid development of portlets. Its main aim is to reduce the time from development to the deployment from several months to a few weeks. Moreover, it facilitates an easy way to share and maintain these portlets by domain specialist themselves. Both these advantages considerably reduce the cost of developing portal solutions for computational science applications. We highlight several scientific domains where our approach is used or was used successfully. Availability: Rapid is developed under an Open Source model and is available freely through a Gnu General Public license. Main releases, documentation, tutorials and examples are available at http://research.nesc.ac.uk/rapid/. The development of Rapid uses an open read-only CVS repository, which is complemented by a developer community site at http://forge.nesc.ac.uk/projects/jos/.
The topic ''`Portals for Life Sciences''' includes various research fields, on the one hand many different topics out of life sciences, e.g. mass spectrometry, on the other hand portal technologies and diffe- rent aspects of computer science, such as usability of user interfaces and security of systems. The main aspect about portals is to simplify the user's interaction with computational resources which are concer- ted to a supported application domain.
IWPLS'09 focuses on research contributions for portals and tools in the field of life sciences. It brings together scientists from the fields of life science, bioinformatics and computer science. Its aim is to become the international platform to exchange experience, formulate ideas, and catch up on technological advances in molecular and systems biology in the context of portals. All papers published in these proceedings were accepted through a peer-reviewing process. Each paper had a 30-minute presentation and each accepted abstract had a lightning talk of 10 minutes. We would like to thank the authors for their contributions and our Program Committee for the effort put into reviewing. Nine papers were selected out of the excellent submissions. We owe much gratitude to the local organisers, for without their hard work the workshop would not have been such a success. We acknowledge both the e-Science Institute in Edinburgh and the Scottish Bioinformatics Forum for their financial contributions.
|