During the project we have developed an extensible architecture that can be used to support the integration of such heterogeneous biological data sets. There are three major obstacles in such an endeavour: the use of different identifiers for the same biological entities, the diversity of the data models underpinning the biological data, and the requirement to keep the integrated data warehouse current in the face of data and schema changes in the source data sets. In our architecture, entities are categorised into clusters allowing individual biological entities to be annotated with family based data. For example, sequence based clustering enables gene family based annotation of individual sequences.
We use the AutoMed data integration toolkit to store the schemas of the data sources and also the transformations from the source data into the data of the integrated warehouse. These transformations are generated semi-automatically by a process of schema matching and schema restructuring. The transformations can be used to update the warehouse data as entities change, are added, or are deleted in the data sources. The transformations can also be used to support the addition or removal of entire data sources, or evolutions in the schemas of the data sources or of the warehouse itself.
Further, we have developed mechanisms supporting the transfer and incremental update
of the MSD database at remote sites. These mechanisms have been implemented at the
Birkbeck/UCL Bloomsbury sites successfully, and represent the first successful
implementation of the MSD database and incremental update mechanisms at sites outside
the EBI.
Cluster based integration of Heterogeneous Biological Databases using the AutoMed toolkit M Maibaum, L Zamboulis, G Rimon, C.Orengo, N Martin, A.Poulovasillis, Proc. 2nd International Workshop Data Integration in the Life Sciences DILS 2005, 191-207, (2005).