Comparison
of Naïve Bayes, Logistic Regression and Markov
Model for Protein-DNA Interaction Prediction
By
Michelle Ruse and Oksana Yakhnenko
May 3, 2006
CS672 project
·
INTRODUCTION
Will an individual
develop a certain disease given the presence or absence of certain protein-DNA
interactions? What purpose do certain cells perform? Which medicines will interact the best with a cell to remedy cell damage or
eradicate foreign substances within a cellular body? The bio-medical implications from the
study of protein-DNA interaction are numerous.
From a given sequence of amino acids, we want to predict classes of each of these amino acids in a sequence. Such class predictions applications would include protein-protein interaction, protein-DNA interaction, or surface residue prediction, for instance. Limiting our class to a protein-DNA interaction gives a binary classifier of {0,1} for {"has protein-DNA interaction","does not have protein-DNA interaction"}, respectively.

·
DATA ( train
data | test data )
The data set used for
this work is that used by Yan, et al. (see References
in paper). From the Protein Data Bank
(PDB), data was extracted from structures of known protein-DNA complexes. In order to further sample the data so
that a subset of relatively high PDB structure quality and/or mutual sequence
identity served as the sample, culling was performed via PISCES (Protein
Sequence Culling Server). This
process left 171 DNA-binding proteins sequence with identity less than or equal
to 30 percent with at least 40 amino acids per sequence.
·
CODE ( code
)
Code requires the following packages:
WEKA: http://www.cs.waikato.ac.nz/ml/weka/
airIDM: http://www.cs.iastate.edu/~dcaragea/AirlDM/index.htm
libSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/