Integrating Transcriptomics and Structural Data to Reveal Protein Functions

Funding and Staffing Details

This project, funded by the Wellcome Trust, is a collaboration between computer scientists, bioinformaticians, biologists and experimental scientists in a number of departments in University College London, Birkbeck, Institute of Child Health, Brunel University and the EBI. The investigator at Birkbeck is Nigel Martin in the School of Computer Science and Information Systems. Galia Rimon is a research fellow working on the project at Birkbeck.

Project Aims

The success of the international genome projects and the human genome project, in particular, is a major landmark in the battle against disease. Post-genomic analysis now aims to describe where and when genes are expressed, and how they function in normal and diseased states. The advent of gene microarrays has contributed enormously to our ability to address these questions and raises the exciting possibility of identifying transcriptional "fingerprints" associated with biochemical pathways and processes and also phenotypically with development, physiology and disease. However, DNA microarrays generate vast amounts of data, and interpreting this data to shed light on biological function is a major challenge.

At the same time that transcriptional analysis is booming, considerable data on protein family, function and pathway/process data is accumulating as a result of the structural and functional genomics initiatives. Structural data, in particular, can give profound insights into protein function and the nature of protein-protein interactions, and annotations can now be provided for nearly 50% of some genomes and above 80% of genes coding for enzymes and other proteins participating in biochemical pathways. In this project we therefore propose to use extensive structural data to provide prior knowledge on gene function.

The development of successful strategies for mining transcriptomics data will critically depend on integrating the expression data with this protein family/structure/function data. This data will provide crucial prior knowledge on gene function for guiding and interpreting the clustering of co-expressed genes, or when modelling temporal events in gene expression patterns.

In this project we will develop an integrated data warehouse - BioMap - containing protein family data i.e sequence, structure, function and pathway/process data integrated with the gene expression and other experimental data. The data warehouse will then underpin data mining protocols which we will develop which use prior knowledge of protein family and functions to facilitate analysis of co-expressed genes. We will also develop methods for data visualisation, especially for pathways/processes and interacting proteins suggested by the data mining.

The focus at Birkbeck is the construction of the integrated data warehouse. The significant Computer Science challenges which arise are, first, the modelling and integration of functional data within the data warehouse and, second, the development of data warehousing techniques to support data mining exploiting changing domain knowledge.

Project Publications

BioMap: Gene Family based Integration of Heteregeneous Biological Datbases using AutoMed Metadata M Maibaum, G Rimon, C.Orengo, N Martin, A.Poulovasillis, Proc. 15th International Workshop on Database and Expert Systems Application DEXA 2004, 384-388, (2004).

Cluster based integration of Heterogeneous Biological Databases using the AutoMed toolkit M Maibaum, L Zamboulis, G Rimon, C.Orengo, N Martin, A.Poulovasillis, Proc. 2nd International Workshop Data Integration in the Life Sciences DILS 2005, 191-207, (2005).