|
Current Work:
My main research interests are in Bioinformatics, Computational
Biology and Machine Learning. My PhD dissertation focuses on
building a flexible algorithmic framework to predict a wide
range of biological problems from alternative representations
of protein sequences.
Recent completion of many genome projects has generated large
amounts of protein sequence data. One urgent and challenging
task is to identify, label, and classify these proteins into
related groups. These groups include functional classes, structural
classes, subcellular localization of the protein, and identifying
possible interactions among these proteins. However, identifying
these classifications using biology experiments lags far behind
the increasing speed of sequencing proteins. Computational methods
that can identify and classify related proteins in large scale
are in urgent need.
In my PhD studies, I developed machine-learning methods to
identify functional classes, structural classes, and subcellular
localization classes based on alternative representations
of protein sequences. In previous work, I used a data-driven
approach to train a decision trees to predict functional
classes of proteins. The decision tree used the presence
or absence of sequence motifs based on reduced alphabet
representation of the amino acid sequence. I was able to
show that these new decision-trees based on reduced alphabets
were able to reliable classify these proteins. In addition
these new representations provided different but complimentary
insights in the protein-structure-function relationship
of proteins that were not present when using the entire
20 letter alphabet. Later I helped develop a new generalized
version of Naive Bayes called NB(k) that builds a Bayesian
network based on sequence information. It is capable of
handing a network where each node can have k-dependencies.
In relation to a protein sequence, a classifier can be build
on amino acid composition (k=1), composition of amino acid
dimers (k=2), trimers (k=3), or any amount of k-dependencies
needed. Previously it was not theoretically sound to use
Naive Bayes with anything beyond amino acid composition
since it seriously violated the independence assumption
of Naive Bayes. NB(k) solved this solution. NB(k) is a fast,
reliable, and easily updateable machine learning approach
that uses no pre-processing of sequence data (e.g. multiple-sequence
alignment). Combining NB(k) output
with a sequence similarity approach such as PSI-BLAST has produced
very reliable classifiers on a wide range of problems including:
protein function, gene ontology labels, gene ontology misclassifications,
secondary structure domains, and subcellular localization.
In the next few years I will extend the research in following
directions:
(1) I will extend my previous methods in the context of specific
biological problems. A specific example is to use the NB(k)
method to identify CIS and TRANS regulatory states among all
Proline amino acids found in PDB. Using the log likelihood ratios
used for prediction in NB(k), we can identify with high specificity
candidates that are involved in isomerization. These sights
then can be experimentally determined using NMR.
(2) I will extend my previous methods to predict motifs in proteins
without using multiple sequence alignment. I will look at the
estimated probabilities of small conserved regions among a related
set of proteins. By focusing on high probabilities, with large
log likelihood ratios, and small variance, it is possible to
have a narrow enough search space to experimentally search for
overlapping regions. The overlapping regions become your motifs.
This method will be used to predict signal sites, target sites,
catalytic sites, and any other possible active sights in proteins.
(3) I will systematically study the relationship among alphabet
size, sequence similarity, the size of the dataset, and the
prediction performance in protein classification problems. As
alphabet size increases a classifier gains sensitivity at the
cost of lost selectivity. This cost is amplified as the sequence
similarity gets smaller. When the amount of data available becomes
fewer there is less desire to have a highly sensitive classifier.
The goal is to find the ideal trade-off between sensitivity
and selectivity given the available data. I will use information
theory to determine what is the minimal information necessary
to build a reliable classifier.
(4) I will extend previous computational methods to predict
protein classifications based on taxonomies, directed acyclic
graphs (DAG), or other structural hierarchies. Currently there
are growing amounts of classification schemas based on these
structures. Examples include the Gene Ontology project (DAG),
the Structural Classification of Proteins (taxonomy), and EC
classification (taxonomy). Current methods that do prediction
on these classifications must take a subset of the data that
corresponds to a specific level in the hierarchy. I will provide
a quick and flexible method that is able to do prediction at
any given level of the hierarchy or prediction of any path in
the hierarchy.
Identification and classification of proteins into meaningful
categories is a very important problem in biology. The research
in computational methods to predict such categories in a quick
and reliable fashion is still in its initial stage. This biological
problem provides challenging targets for machine-learning research
to learn models for classification, causal modeling, and regression.
Solving this problem requires contributions from both biologists
and computer scientists. With education and research experience
in both biology and computer science, I am well prepared for
achieving a productive faculty career in the cross discipline
of life science and computer science. I will devote my knowledge
to pursue this problem by collaborations with researchers from
both computation and biology communities.
Useful Links:
NCBI page
NCBI FTP page
PDB page
|