Phylogenetic trees, also known as phylogenies, represent the evolutionary history of sets of species. The construction of such trees is an attempt to understand the origin of life. Phylogenies have several practical uses. For instance, they offer biologists a tool to predict gene function, by comparing and leveraging information among species related by evolution. They also help to track changes in rapidly developing organisms such as viruses or cancer cells.
Building phylogenetic trees from gene data poses major computational problems. Every known formulation of the task leads to an intractable (NP-hard) problem. Further, data is scarce: not every gene has been sequenced for all species, and not every species has the same set of genes. Thus, to build comprehensive phylogenies, we must piece together information from different genes. Complicating matters even further, gene evolution does not always conform to the history of speciation. In fact, due to gene duplication and loss, and horizontal gene transfer, it is not rare for gene phylogenies to be in conflict with each other.
The proposed research addresses the mathematical and algorithmic problems that arise from data scarcity and conflict in phylogenetics. Among the interrelated themes we will explore are the detection of conflict, the properties of dissimilarity measures between trees, and the identification of common patterns in collections of trees. A portion of our work will center on the supertree problem, where the goal is to amalgamate a collection of trees for partially overlapping sets of species into a single phylogeny -a supertree- for the combined set of species. We will investigate the properties a supertree method can and cannot satisfy, and develop mathematical characterizations of the solvable instances of important classes of supertree problems. We will use the latter to identify parameterizations under which certain problems become tractable. Our methods for mining phylogenetic databases for common patterns will enable biologists to identify well-supported evolutionary relationships among species and to spot "rogue taxa"; i.e., species whose presence radically alters the result of a phylogenetic analysis. We will also investigate supertree construction and data mining for multi-labeled trees (trees where each species can occur multiple times), which are now commonplace in phylogenetic databases.