Iowa State University

Iowa State UniversityIowa State University
Carson M. Andorf
Artificial Intelligence Research Laboratory

Department of Computer Science

My Research


Current Work:

My main research interests are in Bioinformatics, Computational Biology and Machine Learning. My PhD dissertation focuses on building a flexible algorithmic framework to predict a wide range of biological problems from alternative representations of protein sequences.


Recent completion of many genome projects has generated large amounts of protein sequence data. One urgent and challenging task is to identify, label, and classify these proteins into related groups. These groups include functional classes, structural classes, subcellular localization of the protein, and identifying possible interactions among these proteins. However, identifying these classifications using biology experiments lags far behind the increasing speed of sequencing proteins. Computational methods that can identify and classify related proteins in large scale are in urgent need.


In my PhD studies, I developed machine-learning methods to identify functional classes, structural classes, and subcellular localization classes based on alternative representations of protein sequences. In previous work, I used a data-driven approach to train a decision trees to predict functional classes of proteins. The decision tree used the presence or absence of sequence motifs based on reduced alphabet representation of the amino acid sequence. I was able to show that these new decision-trees based on reduced alphabets were able to reliable classify these proteins. In addition these new representations provided different but complimentary insights in the protein-structure-function relationship of proteins that were not present when using the entire 20 letter alphabet. Later I helped develop a new generalized version of Naive Bayes called NB(k) that builds a Bayesian network based on sequence information. It is capable of handing a network where each node can have k-dependencies. In relation to a protein sequence, a classifier can be build on amino acid composition (k=1), composition of amino acid dimers (k=2), trimers (k=3), or any amount of k-dependencies needed. Previously it was not theoretically sound to use Naive Bayes with anything beyond amino acid composition since it seriously violated the independence assumption of Naive Bayes. NB(k) solved this solution. NB(k) is a fast, reliable, and easily updateable machine learning approach that uses no pre-processing of sequence data (e.g. multiple-sequence alignment). Combining NB(k) output with a sequence similarity approach such as PSI-BLAST has produced very reliable classifiers on a wide range of problems including: protein function, gene ontology labels, gene ontology misclassifications, secondary structure domains, and subcellular localization.


In the next few years I will extend the research in following directions:


(1) I will extend my previous methods in the context of specific biological problems. A specific example is to use the NB(k) method to identify CIS and TRANS regulatory states among all Proline amino acids found in PDB. Using the log likelihood ratios used for prediction in NB(k), we can identify with high specificity candidates that are involved in isomerization. These sights then can be experimentally determined using NMR.


(2) I will extend my previous methods to predict motifs in proteins without using multiple sequence alignment. I will look at the estimated probabilities of small conserved regions among a related set of proteins. By focusing on high probabilities, with large log likelihood ratios, and small variance, it is possible to have a narrow enough search space to experimentally search for overlapping regions. The overlapping regions become your motifs. This method will be used to predict signal sites, target sites, catalytic sites, and any other possible active sights in proteins.


(3) I will systematically study the relationship among alphabet size, sequence similarity, the size of the dataset, and the prediction performance in protein classification problems. As alphabet size increases a classifier gains sensitivity at the cost of lost selectivity. This cost is amplified as the sequence similarity gets smaller. When the amount of data available becomes fewer there is less desire to have a highly sensitive classifier. The goal is to find the ideal trade-off between sensitivity and selectivity given the available data. I will use information theory to determine what is the minimal information necessary to build a reliable classifier.


(4) I will extend previous computational methods to predict protein classifications based on taxonomies, directed acyclic graphs (DAG), or other structural hierarchies. Currently there are growing amounts of classification schemas based on these structures. Examples include the Gene Ontology project (DAG), the Structural Classification of Proteins (taxonomy), and EC classification (taxonomy). Current methods that do prediction on these classifications must take a subset of the data that corresponds to a specific level in the hierarchy. I will provide a quick and flexible method that is able to do prediction at any given level of the hierarchy or prediction of any path in the hierarchy.


Identification and classification of proteins into meaningful categories is a very important problem in biology. The research in computational methods to predict such categories in a quick and reliable fashion is still in its initial stage. This biological problem provides challenging targets for machine-learning research to learn models for classification, causal modeling, and regression. Solving this problem requires contributions from both biologists and computer scientists. With education and research experience in both biology and computer science, I am well prepared for achieving a productive faculty career in the cross discipline of life science and computer science. I will devote my knowledge to pursue this problem by collaborations with researchers from both computation and biology communities.

Useful Links:
NCBI page
NCBI FTP page
PDB page