Jano's Homepage
Personal
Publications
 · Constraint Satisfaction
 · Data Mining
 · Dynamic Problems
 · e-Science
 · Evolutionary Art
 · Problem Evolving  
Presentations
Projects
Photos
Past

Jano van Hemert


 
Publications » e-Science


Federated Enactment of Workflow Patterns
inproceedings Gagarine Yaikhom and Liew, Chee Sun and Liangxiu Han and van Hemert, Jano and Malcolm Atkinson and Amy Krause @ 2010/07/12
Euro-Par, pages 317-328.

In this paper we address two research questions concerning workflows: 1) how do we abstract and catalogue recurring workflow patterns?; and 2) how do we facilitate optimisation of the mapping from workflow patterns to actual resources at runtime? Similar questions have been previously investigated in the context of optimising compilers: our aim here is to explore techniques that are applicable to large-scale workflow compositions, where the resources could change dynamically during the life-time of an application. We achieve this by introducing a mechanism for pattern abstraction and cataloguing, supported by runtime semantic bindings that are conditional to the execution parameters. Using a data mining application from the life sciences, we demonstrate this new approach experimentally.



Correcting for intra-experiment variation in Illumina BeadChip data is necessary to generate robust gene-expression profiles
article Kitchen, Robert and Sabine, Vicky and Sims, Andrew and Macaskill, E Jane and Renshaw, Lorna and Thomas, Jeremy and van Hemert, Jano and Dixon, J Michael and Bartlett, John @ 2010/03/25
BMC Genomics, 11(1), 2010, pages 134.
[ url ]

BACKGROUND:Microarray technology is a popular means of producing whole genome transcriptional profiles, however high cost and scarcity of mRNA has led many studies to be conducted based on the analysis of single samples. We exploit the design of the Illumina platform, specifically multiple arrays on each chip, to evaluate intra-experiment technical variation using repeated hybridisations of universal human reference RNA (UHRR) and duplicate hybridisations of primary breast tumour samples from a clinical study.RESULTS:A clear batch-specific bias was detected in the measured expressions of both the UHRR and clinical samples. This bias was found to persist following standard microarray normalisation techniques. However, when mean-centering or empirical Bayes batch-correction methods (ComBat) were applied to the data, inter-batch variation in the UHRR and clinical samples were greatly reduced. Correlation between replicate UHRR samples improved by two orders of magnitude following batch-correction using ComBat (ranging from 0.9833-0.9991 to 0.9997-0.9999) and increased the consistency of the gene-lists from the duplicate clinical samples, from 11.6% in quantile normalised data to 66.4% in batch-corrected data. The use of UHRR as an inter-batch calibrator provided a small additional benefit when used in conjunction with ComBat, further increasing the agreement between the two gene-lists, up to 74.1%.CONCLUSION:In the interests of practicalities and cost, these results suggest that single samples can generate reliable data, but only after careful compensation for technical bias in the experiment. We recommend that investigators appreciate the propensity for such variation in the design stages of a microarray experiment and that the use of suitable correction methods become routine during the statistical analysis of the data.



Molecular Orbital Calculations of Inorganic Compounds
incollection Morrison, C.A. and Robertson, N. and Turner, A. and van Hemert, J. and Koetsier, J. @ 2010/03/25
Inorganic Experiments, pages 261-267.


An Open Source Toolkit for Medical Imaging De-Identification
article Rodr\'\iguez González, D. and T. Carpenter and van Hemert, J.I. and J. Wardlaw @ 2010/01/05
European Radiology, 20(8), 2010, pages 1896-1904.
[ url ]

Objective Medical imaging acquired for clinical purposes can have several legitimate secondary uses in research projects and teaching libraries. No commonly accepted solution for anonymising these images exists because the amount of personal data that should be preserved varies case by case. Our objective is to provide a flexible mechanism for anonymising Digital Imaging and Communications in Medicine (DICOM) data that meets the requirements for deployment in multicentre trials. Methods We reviewed our current de-identification practices and defined the relevant use cases to extract the requirements for the de-identification process. We then used these requirements in the design and implementation of the toolkit. Finally, we tested the toolkit taking as a reference those requirements, including a multicentre deployment. Results The toolkit successfully anonymised DICOM data from various sources. Furthermore, it was shown that it could forward anonymous data to remote destinations, remove burned-in annotations, and add tracking information to the header. The toolkit also implements the DICOM standard confidentiality mechanism. Conclusion A DICOM de-identification toolkit that facilitates the enforcement of privacy policies was developed. It is highly extensible, provides the necessary flexibility to account for different de-identification requirements and has a low adoption barrier for new users.



Rapid chemistry portals through engaging researchers
inproceedings Koetsier, J. and Turner, A. and Richardson, P. and van Hemert, J.I. @ 2009/12/08
Fifth IEEE International Conference on e-Science, pages 284-291.

In this study, we apply a methodology for rapid development of portlets for scientific computing to the domain of computational chemistry. We report results in terms of the portals delivered, the changes made to our methodology and the experience gained in terms of interaction with domain-specialists. Our major contributions are: several web portals for teaching and research in computational chemistry; a successful transition to having our development tool used by the domain specialist as opposed by us, the developers; and an updated version of our methodology and technology for rapid development of portlets for computational science, which is free for anyone to pick up and use.



Using the DCC Lifecycle Model to Curate a Gene Expression Database: A Case Study
article J. O'Donoghue and van Hemert, J.I. @ 2009/12/01
International Journal of Digital Curation, 4(3), 2009.

Developmental Gene Expression Map (DGEMap) is an EU-funded Design Study, which will accelerate an integrated European approach to gene expression in early human development. As part of this design study, we have had to address the challenges and issues raised by the long-term curation of such a resource. As this project is primarily one of data creators, learning about curation, we have been looking at some of the models and tools that are already available in the digital curation field in order to inform our thinking on how we should proceed with curating DGEMap. This has led us to uncover a wide range of resources for data creators and curators alike. Here we will discuss the future curation of DGEMap as a case study. We believe our experience could be instructive to other projects looking to improve the curation and management of their data



Giving Computational Science a Friendly Face
article van Hemert, J.I. and Koetsier, J. @ 2009/10/01
Zero-In, 1(3), 2009, pages 12-13.
[ url ]

Anyone who is purchasing a flight using a web browser expects to be guided through this task: from choosing the possible routes, to finding suitable dates and times, and to paying with a credit card. Today, most researchers from any discipline will successfully use these web-based e-commerce systems to book flights to attend their conferences. When these researchers are then confronted with solving compute-intensive problems, they need not expect such elaborate web-based systems to enable their domain-specific tasks. Instead, they will have to deal with archaic command-line tools and in the best case they may have access to generic portals that mimic the technical complexity of the underlying infrastructure. These interfaces are expensive to use as they require much investment from the researchers in terms of training. Moreover, the laborious and intricate processes involved often lead to errors.



Proceedings of the IWPLS09 International Workshop on Portals for Life Sciences
proceedings Gesing, S. and van Hemert, J.I. @ 2009/09/09, Edinburgh, UK
Proceedings of the International Workshop on Portals for Life Sciences.
[ url ]

IWPLS'09 focuses on research contributions for portals and tools in the field of life sciences. It brings together scientists from the fields of life science, bioinformatics and computer science. Its aim is to become the international platform to exchange experience, formulate ideas, and catch up on technological advances in molecular and systems biology in the context of portals. All papers published in these proceedings were accepted through a peer-reviewing process. Each paper had a 30-minute presentation and each accepted abstract had a lightning talk of 10 minutes. We would like to thank the authors for their contributions and our Program Committee for the effort put into reviewing. Nine papers were selected out of the excellent submissions. We owe much gratitude to the local organisers, for without their hard work the workshop would not have been such a success. We acknowledge both the e-Science Institute in Edinburgh and the Scottish Bioinformatics Forum for their financial contributions.



An E-infrastructure to Support Collaborative Embryo Research
inproceedings A. Barker and van Hemert, J.I. and R.A. Baldock and M.P. Atkinson @ 2009/05/22
Cluster Computing and the Grid, pages 520-525.

Within the context of the EU Design Study Developmental Gene Expression Map, we identify a set of challenges when facilitating collaborative research on early human embryo development. These challenges bring forth requirements, for which we have identified solutions and technology. We summarise our solutions and demonstrate how they integrate to form an e-infrastructure to support collaborative research in this area of developmental biology.



Towards a Virtual Fly Brain
article Armstrong, J.D. and van Hemert, J.I. @ 2009/03/01
Philosophical Transactions A, 367(1896), 2009, pages 2387-2397.
[ pdf | url ]

Models of the brain that simulate sensory input, behavioural output and information processing in a biologically plausible manner pose significant challenges to both Computer Science and Biology. Here we investigated strategies that could be used to create a model of the insect brain, specifically that of Drosophila melanogaster which is very widely used in laboratory research. The scale of the problem is an order of magnitude above the most complex of the current simulation projects and it is further constrained by the relative sparsity of available electrophysiological recordings from the fly nervous system. However, fly brain research at the anatomical and behavioural level offers some interesting opportunities that could be exploited to create a functional simulation. We propose to exploit these strengths of Drosophila CNS research to focus on a functional model that maps biologically plausible network architecture onto phenotypic data from neuronal inhibition and stimulation studies, leaving aside biophysical modelling of individual neuronal activity for future models until more data is available.



The Circulate Architecture: Avoiding Workflow Bottlenecks Caused By Centralised Orchestration
article Barker, A. and Weissman, J. and van Hemert, J.I. @ 2009/03/01
Cluster Computing, 12(2), 2009, pages 221-235.
[ url ]

As the number of services and the size of data involved in workflows increases, centralised orchestration techniques are reaching the limits of scalability. In the classic orchestration model, all data passes through a centralised engine, which results in unnecessary data transfer, wasted bandwidth and the engine to become a bottleneck to the execution of a workflow. This paper presents and evaluates the Circulate architecture which maintains the robustness and simplicity of centralised orchestration, but facilitates choreography by allowing services to exchange data directly with one another. Circulate could be realised within any existing workflow framework, in this paper, we focus on WS-Circulate, a Web services based implementation. Taking inspiration from the Montage workflow, a number of common workflow patterns (sequence, fan-in and fanout), input to output data size relationships and network configurations are identified and evaluated. The performance analysis concludes that a substantial reduction in communication overhead results in a 2-4 fold performance benefit across all patterns. An end-to-end pattern through the Montage workflow results in an 8 fold performance benefit and demonstrates how the advantage of using the Circulate architecture increases as the complexity of a workflow grows.



Matching Spatial Regions with Combinations of Interacting Gene Expression Patterns
inproceedings van Hemert, J.I. and R.A. Baldock @ 2008/07/07
Proceedings of the 2nd International Conference on BioInformatics Research and Development, pages 347-361.

The Edinburgh Mouse Atlas aims to capture in-situ gene expression patterns in a common spatial framework. In this study, we construct a grammar to define spatial regions by combinations of these patterns. Combinations are formed by applying operators to curated gene expression patterns from the atlas, thereby resembling gene interactions in a spatial context. The space of combinations is searched using an evolutionary algorithm with the objective of finding the best match to a given target pattern. We evaluate the method by testing its robustness and the statistical significance of the results it finds.



Scientific Workflow: A Survey and Research Directions
inproceedings Barker, A. and van Hemert, J.I. @ 2008/05/29
Parallel Processing and Applied Mathematics, pages 746-753.
[ url ]

Workflow technologies are emerging as the dominant approach to coordinate groups of distributed services. However with a space filled with competing specifications, standards and frameworks from multiple domains, choosing the right tool for the job is not always a straightforward task. Researchers are often unaware of the range of technology that already exists and focus on implementing yet another proprietary workflow system. As an antidote to this common problem, this paper presents a concise survey of existing workflow technology from the business and scientific domain and makes a number of key suggestions towards the future development of scientific workflow systems.



Data Integration in eHealth: A Domain/Disease Specific Roadmap
inproceedings J. Ure and R. Proctor and M. Martone and D. Porteous and S. Lloyd and S. Lawrie and D. Job and R. Baldock and A. Philp and D. Liewald and F. Rakebrand and A. Blaikie and C. McKay and S. Anderson and J. Ainsworth and van Hemert, J. and I. Blanquer and R. Sinnott and C. Barillot and F. Bernard Gibaud and A. Williams and M. Hartswood and P. Watson and L. Smith and A. Burger and J. Kennedy and H. Gonzalez-Velez and R. Stevens and O. Coecho and R. Morton and P. Linksted and M. Deschenne and M. McGilchrist and P Johnson and A. Voss and R. Gertz and J. Wardlaw @ 2007/04/27
Studies in Health Technology and Informatics, pages 144-153.
[ pdf ]

The paper documents a series of data integration workshops held in 2006 at the UK National e-Science Centre, summarizing a range of the problem/solution scenarios in multi-site and multi-scale data integration with six HealthGrid projects using schizophrenia as a domain-specific test case. It outlines emerging strategies, recommendations and objectives for collaboration on shared ontology-building and harmonization of data for multi-site trials in this domain.



Mining spatial gene expression data for association rules
inproceedings van Hemert, J.I. and R.A. Baldock @ 2007/03/12
Proceedings of the 1st International Conference on BioInformatics Research and Development, pages 66-76.
[ pdf | url ]

We analyse data from the Edinburgh Mouse Atlas Gene-Expression Database (EMAGE) which is a high quality data source for spatio-temporal gene expression patterns. Using a novel process whereby generated patterns are used to probe spatially-mapped gene expression domains, we are able to get unbiased results as opposed to using annotations based predefined anatomy regions. We describe two processes to form association rules based on spatial configurations, one that associates spatial regions, the other associates genes.