Comparison of Naïve Bayes, Logistic Regression and Markov Model for Protein-DNA Interaction Prediction

By

Michelle Ruse and Oksana Yakhnenko

May 3, 2006

CS672 project

 

 

 

 

·       INTRODUCTION

       Will an individual develop a certain disease given the presence or absence of certain protein-DNA interactions? What purpose do certain cells perform? Which medicines will interact the best with a cell to remedy cell damage or eradicate foreign substances within a cellular body?  The bio-medical implications from the study of protein-DNA interaction are numerous.

 

From a given sequence of amino acids, we want to predict classes of each of these amino acids in a sequence. Such class predictions applications would include protein-protein interaction, protein-DNA interaction, or surface residue prediction, for instance.  Limiting our class to a protein-DNA interaction gives a binary classifier of {0,1} for {"has protein-DNA interaction","does not have protein-DNA interaction"}, respectively.

 

                      

 

·       DATA   ( train data | test data )

The data set used for this work is that used by Yan, et al. (see References in paper).  From the Protein Data Bank (PDB), data was extracted from structures of known protein-DNA complexes.  In order to further sample the data so that a subset of relatively high PDB structure quality and/or mutual sequence identity served as the sample, culling was performed via PISCES (Protein Sequence Culling Server).  This process left 171 DNA-binding proteins sequence with identity less than or equal to 30 percent with at least 40 amino acids per sequence.

     

 

·       PAPER  ( pdf | ps )

 

·       CODE  ( code )

Code requires the following packages:

WEKA: http://www.cs.waikato.ac.nz/ml/weka/

airIDM: http://www.cs.iastate.edu/~dcaragea/AirlDM/index.htm

libSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/