Mapping individual gene data on
an evolutionary tree
B. Mirkin, T.I. Fenner, G. Loizou
Project outline and aims
Evolutionary trees are an important instrument in inter-genome analysis. Traditionally, computational biology focuses on the problem of tree building. This problem can be formulated as follows: given some data on a set of extant species, build a (rooted) tree whose leaves correspond to the extant species and interior nodes to their ancestors, in such a way that more similar species get later divergence events leading to them. This project is devoted to a related problem - developing methods of interpretation of various types of data on the extant species by mapping them in a biologically meaningful way onto an evolutionary tree and annotating the tree nodes with relevant evolutionary events.
In particular, we are concerned with three specific projects:
O. Eulenstein, B. Mirkin, and M. Vingron (1997) Comparison of annotating duplication, tree mapping, and copying as methods to compare gene trees with species trees, in B. Mirkin, F. McMorris, F. Roberts, and A. Rzhetsky (Eds.) Mathematical Hierarchies and Biology, DIMACS Series, V. 37, Providence: AMS, 71-94.
O. Eulenstein, B. Mirkin, and M. Vingron (1998) Duplication-based measures of difference between gene and species trees, Journal of Computational Biology, 5, 135-148.
B. Mirkin (2004) Mapping gene family data onto evolutionary trees, in M. Chavent, O. Dordan, C. Lacomblez, M. Langlais, and B. Patouille (Eds.), Comptes rendus des 11es Rencontres de la Societe Francophone de Classification, University of Bordeaux, 61-68.
B. Mirkin, T. Fenner, M. Galperin and E. Koonin (2003) Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes, BMC Evolutionary Biology 2003, 3:2.
B. Mirkin and E. Koonin (2003) A top-down method for building genome classification trees with linear binary hierarchies, in M. Janowitz, J.-F. Lapointe, F. McMorris, B. Mirkin, and F. Roberts (Eds.) Bioconsensus, DIMACS Series, V. 61, Providence: AMS, 97-112.
K.S. Makarova, Y.I. Wolf, S.L. Mekhedov, B. Mirkin and E.V. Koonin (2005) Ancestral paralogs and pseudoparalogs and their role in the emergence of the eukaryotic cell, Nucleic Acids Research, 2005, Vol. 33, No. 14, 4626-4638.
Two major discoveries are:
(1) Clear-cut cases of an ancestral protein sequence diverged so much in the process of evolution that there is no similarity between its descendants in the extant species; still we were able to determine their orthologous character computationally by matching (i) our reconstruction results and (ii) information on gene arrangement in the genomes.
(2) Surprisingly, in spite of considerable international efforts in determining functions of viral proteins, the function of 85% of those ancestral genes fine-tuning the separation of the beta and gamma super-families from the rest remain unknown.
On the computational side, VIDA database HPFs were updated by extensive search through major bioinformatics databases whereby we overcame numerous inconsistencies between different submissions. Our annotation of the tree with the original VIDA HPFs showed that further aggregation of the HPFs was needed. Thus, we had to develop a novel clustering method involving protein neighbourhoods, majority lists, data recovery clustering and the similarity scale shift. The latter, a crucial parameter, has been adjusted by computationally iterating with domain knowledge: first by using HPFs with known functions, then by comparing the reconstructed histories with gene arrangements in the genomes. We also developed maximum likelihood versions of our approach involving either node-specific or constant probabilities of loss/gain.
Some materials:
B. Mirkin, R. Camargo, T. Fenner, G. Loizou, P. Kellam (2006) Aggregation of Homologous Protein Families (HPFs) for mapping them onto an evolutionary tree, MASAMB - Mathematical and Statistical Aspects of Molecular Biology, Dublin, April 2006.
B. Mirkin, R. Camargo, T. Fenner, G. Loizou, P. Kellam (2006)
Aggregating Homologous Protein Families in evolutionary reconstructions of herpesviruses,
2006 IEEE Symposium on Comp. Intelligence in Bioinformatics & Comp. Biology, 255-263, Toronto, September 2006.
B. Mirkin, R. Camargo, T. Fenner, G. Loizou, P. Kellam (2007)
Using domain knowledge and shift of origin in clustering
similarity data (submitted).
Subjects for student projects
Extending the algorithm for parsimoniously mapping of gain and loss
events to unresolved evolutionary trees.
Using similarities between proteins for selecting an evolutionary
scenario of a gene.
Finding and visualising evolutionary events for individual gene
families.