Artificial Intelligence Research Laboratory
Department of Computer Science
Iowa State University


Macromolecular Structure-Function Prediction
   Personnel   Project Summary   Funding   Publications   Additional Information   Projects   AI Lab  

Personnel

Project Summary

Characterizing the structure and function of biological macromolecules (e.g., proteins) is a central problem in modern biology. Proteins are abundant in all organisms and are indeed fundamental to life. The diversity of protein structure underlies the very large range of their function. Proteins are linear heteropolymers of fixed length. That is, a single type of protein always has the same number and composition of monomers, but different proteins have a range of monomer units, from a few tens to approximately a thousand. The monomers are amino acids, and there are 20 types, which themselves have a range of chemical properties. There is therefore a great diversity of possible protein sequences. The linear chains fold into specific three-dimensional conformations, which are determined by the sequence of amino acids; proteins are generally self-folding. The three-dimensional structures of proteins are therefore also extremely diverse, ranging from completely fibrous, to globular. Protein structures can be determined to an atomic level by X-ray diffraction and neutron-diffraction studies of crystallized proteins, and more recently, by nuclear magnetic resonance (NMR) spectroscopy of proteins in solution. Recent advances in genome sequencing projects have led to enormous increase in protein sequence data. However, the database for protein structures has lagged far behind the number of known sequences. Protein sequences are encoded in DNA, the holder of genetic information, which is itself a linear molecule composed of four types of bases (monomers which act as letters of the genetic alphabet. In principle, it should therefore be possible to translate a gene sequence into an amino acid sequence, and to predict the three-dimensional structure of the resulting chain from this amino acid sequence by performing a detailed simulation of molecular dynamics of protein folding based on the underlying physics and chemistry of interactions among the monomers. However, searching the entire space of possible conformations for a stable or low energy 3-dimensional structure is a computational impossibility. Consequently, successful approaches to protein structure/function prediction from sequence cannot rely on an explicit search of the conformation space. Instead, the problem, like its counterparts in the realm of complex systems, calls for careful discovery and exploitation of regularities at multiple spatial and temporal scales. Such regularities can be observed at the level of secondary structures (e.g., alpha helix, beta fold with an elementary unit e.g., a spiral, or a hairpin), the so-called foldons (which are believed to be independently folding units), highly conserved signature motifs in the sequence that are characteristic of protein families.

Against this background, our research is aimed at the design and implementation of algorithms for molecular structure/function prediction from sequence data. We have recently succeeded in using data mining approaches to design sequence classifiers that assign protein sequences to corresponding functional families. Our approach maps each protein sequence into an N-dimensional vector whose components encode the presence or absence (or other more sophisticated information measures related to the presence of absence) of sequence motifs. Proteins with known functions are used to generate a training set which is then used to construct a sequence classifier. New sequences are assigned to fuctional families based on complex inter-relationships between sequence motifs discovered by a machine learning algorithm. This work builds on recent work on a broad range of data mining and knowledge discovery algorithms (including those drawn from artificial intelligence, statistical pattern recognition, grammatical inference, neural networks, text classification, and evolutionary computing).

Funding

Publications

Additional Information

To appear.


Dr. Vasant Honavar
Artificial Intelligence Research Laboratory
Department of Computer Science
Iowa State University
Atanasoff Hall, Ames, IA 50011-1040 USA
phone: +1-515-294-1098, +1-515-294-4377; fax: +1-515-294-0258

© Vasant Honavar, 1999, 2000.