#LyX 1.3 created this file. For more info see http://www.lyx.org/ \lyxformat 221 \textclass article \language english \inputencoding auto \fontscheme times \graphics default \paperfontsize 10 \spacing single \papersize Default \paperpackage a4 \use_geometry 0 \use_amsmath 0 \use_natbib 0 \use_numerical_citations 0 \paperorientation portrait \secnumdepth 3 \tocdepth 3 \paragraph_separation indent \defskip medskip \quotes_language english \quotes_times 2 \papercolumns 1 \papersides 1 \paperpagestyle default \layout Title Inter-element dependency models for sequence classification \layout Author Adrian Silvescu, Carson Andorf, Drena Dobbs, Vasant Honavar \layout Date \SpecialChar ~ \layout Standard Artificial Intelligence Laboratory \layout Standard Department of Computer Science and \layout Standard Graduate Program in Bioinformatics and Computational Biology \layout Standard Iowa State University \layout Standard Ames, IA 50011, USA \layout Standard {silvescu|andorfc|ddobs|honavar}@iastate.edu \layout Abstract Naive Bayes is a fast to train model for sequence classification. We develop and experiment with two methods that are equally fast (they require only one pass through the training data), but they are also able to models interactions among close neighbours in the sequence (unlike the Naive bayes independence assumption). The first method basically runs Naive Bayes on overlapping k-grams obtained from the sequence. The second method, called NB(k), constructs a classifier associated with an undirected graphical model over the sequence that takes into account k-wise dependencies. We test our algorithms on protein function classification tasks based on functional families from the Gene Ontology (GO) database. Our results show significant improvements of the proposed methods over Naive Bayes with NB(k) leading over NB k-grams in all test cases. The two proposed methods despite having improved modelling accuracy and demonstrated improved test accuracy maintain the \begin_inset Quotes eld \end_inset one pass through the data only \begin_inset Quotes erd \end_inset training property of Naive Bayes thus yielding fairly accurate and efficient classifiers. \layout Section Introduction \layout Standard Sequence classification is a task that appears in many natural scenarios such as protein sequence function prediction, intrusion detection, text classification, etc. \layout Standard A simple solution that offers a quick and practical approach to this problem is Naive Bayes. Naive Bayes assumes that the elements in the sequence are independent of each other given the class. This independence assumption however seldom holds in practice. Furthermore most of the time the dependencies are very strong among close neighbours in the sequence. In order to address this concern we would like to be able to relax the strong independence assumption of Naive Bayes and allow for a limited amount of interaction among close neighbours, and thus get better modelling accuracy. Despite the additional sophistication we would like to maintain the \begin_inset Quotes eld \end_inset one pass through the data only \begin_inset Quotes erd \end_inset training property of Naive Bayes. Note that since we are preoccupied with sequence classification whenever we talk about Naive Bayes we actually refer to the so called Naive Bayes Multinomial which is more appropriate for our case (sequence classification) as opposed to the Multivariate model (see \begin_inset LatexCommand \cite{mccallum98comparison} \end_inset for a discussion about it). \layout Standard We are going to provide two methods to deal with this problem. The first one basically splits the sequence into a sequence of overlapping k-grams and runs a Naive Bayes algorithm in the transformed space. We call the resulting algorithm NB k-grams. This method however violates the assumption of independence that Naive Bayes holds in an obvious way: Since the k-grams are overlapping, every two subsequent k-grams share k-1 elements, which makes them evidently non-indep endent. \layout Standard In the second method, in order to deal with the above mentioned problem properly we construct an undirected graphical model for the k-grams \begin_inset LatexCommand \cite{charniak93statistical} \end_inset , which will adequately discount the contributions of the overlaps. We show the equivalence of this undirected model with the Markov Model of order k-1. This contributes to dispelling a misconception that undirected models (i.e., random fields) are more suitable than directed models such as Markov Models for modelling protein sequences. This view may be held based on the fact that the order in the protein sequence has a spatial nature, which is better modelled by undirected models (random fields), rather than a temporal nature as it is the case with Markov Models. While we do not dispute the spatial nature of the oreder in protein sequences, we show that both the undirected (spatial) and the directed models (temporal = Markov Models) associated with with k-grams yield the same probability distribution. \layout Standard In order to use our our undirected models for classification, we apply the basic trick of turning a generative model into a predictor: We train one generative model per class and then when asked to predict a class for a newly seen sequence we output the one whose corresponding generative model outputs the highest probability for the given sequence, weighted with the probability of the class. We call this model NB(k) from Naive Bayes order k (which is when we model dependencies among k attributes). Note that Naive Bayes will turn out to be NB(1) (we model dependencies among 1 attribute, that is no dependencies, i.e., the independence) \layout Standard We put to test our methods on three protein function prediction problems with functional classes extracted from GO (the Gene Ontology \begin_inset LatexCommand \cite{gene00gene} \end_inset ) and corresponding sequences extracted from SWISSPROT \begin_inset LatexCommand \cite{bairoch00swiss-prot} \end_inset . In our experiments NB k-grams outperformed Naive Bayes on all the tests and NB(k) outperformed NB k-grams on all tests too. \layout Standard To recapitulate, we will present two methods that relax the independence assumption of Naive Bayes by modelling interaction among close neighbours in the sequence. The two methods preserve the \begin_inset Quotes eld \end_inset one pass through the data \begin_inset Quotes erd \end_inset training property of the Naive Bayes and they are also also easily update-able. Additionally our second method - NB(k) which is not incidentally, also our best performer, represents and end in itself since it is the most complete undirected graphical model for k-grams modelling. We further explore the equivalence of this undirected model to the (directed) Markov Model of order k-1 thus clearing a misconception about the presumed superiority of undirected graphical models vs. the directed ones in terms of modelling sequences where the order is spatial as opposed to temporal. \layout Standard The rest of the paper will proceed as follows: First, in the next section we will have an in depth exploration of the intuition and theory behind the two methods to be explored. Then, we put them to test in section three and produce some experimental results. Finally, in section four we conclude with a summary and discussion. \layout Section Method \layout Standard In this section we will preoccupy ourselves with unraveling the details of the new models. As a basic strategy we will first start from basic principles and try to develop some intuition behind the new methods followed by more precise and general formulations, which will be furthermore supplemented by theoretical results. \layout Standard We start in the first subsection with an outline of the Naive Bayes Classifier and the general method for producing classifiers from generative models. Then we examine a way to produce undirected probabilistic models that take into account correlations among proximal elements within a sequence. Subsequently, we outline the associated directed graphical models and comment on the relation with the directed ones. We then briefly present the parameter fitting method. \layout Subsection Naive Bayes and the General Method for producing classifiers from generative models. \layout Standard Before plunging into the details, a little bit of \series bold Notation \series default : We will denote a whole sequence with \begin_inset Formula $\overline{S}$ \end_inset , the i-th element of the sequence by \begin_inset Formula $S_{i}$ \end_inset and the value of the i-th element in the sequence by \begin_inset Formula $s_{i}$ \end_inset , furthermore let \begin_inset Formula $V$ \end_inset be the set of all possible values that the elements in the sequence can take. Let \begin_inset Formula C \end_inset be the class variable and let \begin_inset Formula $\{ c_{j}\}_{j=1,m}$ \end_inset be the possible values that C can take. And let \begin_inset Formula $n$ \end_inset be the length of the current sequence i.e., \begin_inset Formula $\overline{S}=[S_{1}=s_{1},...,S_{n}=s_{n}]$ \end_inset . \layout Standard The Naive Bayes algorithm for sequence classification assumes that the successiv e elements in a given sequence are independent of each other given the class. \layout Standard One way to interpret Naive Bayes is as a generative model. A generative model produces the observed data by the means of a probabilistic generation process. The generation process for Naive Bayes is very simple: \layout Standard \begin_inset Float algorithm placement H wide false collapsed false \layout Standard 1. First generate a class \begin_inset Formula $c_{j}$ \end_inset according to the probability P(C) \layout Standard 2. Then for i = 1 to n \layout Standard \begin_inset Formula $\qquad$ \end_inset generate \begin_inset Formula $s_{i}$ \end_inset according to P( \begin_inset Formula $S_{i}$ \end_inset | \begin_inset Formula $C=c_{j}$ \end_inset ). \layout Standard \SpecialChar ~ \layout Standard Note: Since we also assume stationarity P( \begin_inset Formula $S_{i}$ \end_inset | \begin_inset Formula $C=c_{j}$ \end_inset ) = P( \begin_inset Formula $S_{l}$ \end_inset | \begin_inset Formula $C=c_{j}$ \end_inset ) := P( \begin_inset Formula $S$ \end_inset | \begin_inset Formula $C=c_{j}$ \end_inset ) for every \begin_inset Formula $i,l\in\{1,...,n\}$ \end_inset and \begin_inset Formula $c_{j}\in C$ \end_inset \layout Caption \begin_inset LatexCommand \label{cap:Naive-Bayes-Sequence} \end_inset Naive Bayes Sequence Generative Model \end_inset \layout Standard Note that in step 2 we are generating each element in the sequence independent of the previous ones or the ones that are going to come. The generative model implemented by the Algorithm \begin_inset LatexCommand \ref{cap:Naive-Bayes-Sequence} \end_inset can be depicted as a Bayesian Network as shown in Figure \begin_inset LatexCommand \ref{cap:Naive-Bayes-Bayesian} \end_inset . In this figure the arrows from the class variable to the elements of the sequence show the dependencies in the generative process. \layout Standard \begin_inset Float figure wide false collapsed false \layout Standard \begin_inset Graphics filename NB.eps \end_inset \layout Caption \begin_inset LatexCommand \label{cap:Naive-Bayes-Bayesian} \end_inset Naive Bayes Bayesian Network \end_inset \layout Standard The Naive Bayes classifier is a particular example of a general scheme of producing a classifier for a given generative model for the input data (sequence in our case). We will try to outline it in what follows with the purpose of using it with other more complicated generative models. \layout Standard Let us assume we have a probabilistic generative model for sequences \begin_inset Formula $GM$ \end_inset . That is a probabilistic model which given a sequence \begin_inset Formula $\overline{S}=s_{1},...,s_{n}$ \end_inset can give us the probability \begin_inset Formula $P_{GM}(\overline{S}=s_{1},...,s_{n})$ \end_inset according to the generative model \series bold \emph on \begin_inset Formula $GM$ \end_inset \series default \emph default . Then if we want to obtain a classifier using the generative model \begin_inset Formula $GM$ \end_inset we can use the following (standard) procedure: \layout Standard 1. For each class \begin_inset Formula $c_{j}$ \end_inset train a generative model \begin_inset Formula $GM(c_{j})$ \end_inset using the data from class \begin_inset Formula $c_{j}$ \end_inset . \layout Standard 2. Then when asked to classify a new sequence \begin_inset Formula $\overline{S}=s_{1},...,s_{n}$ \end_inset predict it as belonging to class: \layout Standard \begin_inset Formula \[ classification=\arg\max_{c_{j}\in C}\: P_{GM(c_{j})}(\overline{S}=s_{1},...,s_{n})\: P(c_{j})\] \end_inset \layout Standard or equivalently: \layout Standard \begin_inset Formula \[ classification=\arg\max_{c_{j}\in C}\: P(\overline{S}=s_{1},...,s_{n}|GM(c_{j}))\: P(c_{j})\] \end_inset \layout Standard or equivalently if, by abuse of notation, \begin_inset Formula $P(\overline{S}=s_{1},...,s_{n}|c_{j})$ \end_inset stands for \begin_inset Formula $P(\overline{S}=s_{1},...,s_{n}|GM(c_{j}))$ \end_inset we will have: \layout Standard \begin_inset Formula \[ classification=\arg\max_{c_{j}\in C}\: P(\overline{S}=s_{1},...,s_{n}|c_{j})\: P(c_{j})\] \end_inset \layout Standard which in the case of the Naive Bayes generative model is the well known: \layout Standard \begin_inset Formula \[ classification=\arg\max_{c_{j}\in C}\: P(S_{i}=s_{i}|c_{j})\: P(c_{j})\] \end_inset \layout Standard because under the independence assumption given the class we have: \begin_inset Formula \[ P(\overline{S}=s_{1},...,s_{n}|c_{j})=\prod_{i=1}^{n}P(S_{i}=s_{i}|c_{j})\] \end_inset \layout Standard Note that in the case of Naive Bayes the generative model for the input data is assumed to be a set of independent samples drawn from the same multinomial distribution \begin_inset Formula $P(S_{i}=s_{i}|c_{j})$ \end_inset (so the joint \begin_inset Formula $P(S=s_{1},...,s_{n}|c_{j})$ \end_inset can be obtained by taking the product). Furthermore under the stationarity assumption the probability distribution does not depends on the position \begin_inset Formula $i$ \end_inset (i.e., P( \begin_inset Formula $S_{i}$ \end_inset | \begin_inset Formula $C=c_{j}$ \end_inset ) = P( \begin_inset Formula $S_{l}$ \end_inset | \begin_inset Formula $C=c_{j}$ \end_inset ) = P( \begin_inset Formula $S$ \end_inset | \begin_inset Formula $C=c_{j}$ \end_inset ) for every \begin_inset Formula $i,l\in\{1,...,n\}$ \end_inset and \begin_inset Formula $c_{j}\in C$ \end_inset ). \layout Standard We are interested however in capturing more interactions among the elements in the sequence, as they are seldom independent and as such, we will need a more complicated model for the data. Given that this model is established, we can then produce a classifier using the scheme outlined above. So from now on we will forget about classes and we will only be concerned with finding probabilistic models that take into account dependencies among elements in the sequence. And then whenever we shall be asked produce classifications with our derived a probabilistic model \begin_inset Formula $PM$ \end_inset we will apply the schema outlined in this section by replacing in the formulas \begin_inset Formula $P_{GM(c_{j})}(...)$ \end_inset with \begin_inset Formula $P_{PM(c_{j})}(...)$ \end_inset . \layout Standard Now that we have outlined the general scheme for producing classifiers from Probabilistic Models we reduced our problem to the one of finding some appropriate probabilistic models that capture correlations among close elements within a sequence. Upon finding such models we can turn them into a classifier using the scheme outlined above. Hence, classification being set aside, we are ready to launch into the exploration of dependency models among sequence elements. \layout Subsection Models for direct interaction among k consecutive elements \layout Standard In graphical depiction we show the interaction model for Naive Bayes in Figure \begin_inset LatexCommand \ref{cap:Graphical-depiction-of} \end_inset a) for a sequence of five elements. What we would like now to model is dependencies among two or more attributes. One way to represent these dependencies in a graphical form is by drawing edges between the nodes that are supposed to be directly dependent on each other. The graph for pairwise dependencies is illustrated in Figure \begin_inset LatexCommand \ref{cap:Graphical-depiction-of} \end_inset b) and the one for 3-wise dependency is depicted in Figure \begin_inset LatexCommand \ref{cap:Graphical-depiction-of} \end_inset c). \layout Standard \begin_inset Float figure placement t wide false collapsed false \layout Standard \begin_inset Graphics filename sNBk.eps \end_inset \layout Caption \begin_inset LatexCommand \label{cap:Graphical-depiction-of} \end_inset Graphical depiction of the dependence between the elements in a sequence of five elements using Undirected Graphical Models: a) shows the Naive Bayes [= independence model]; b) shows pairwise dependence and c) shows 3-wise dependence. \end_inset \layout Standard The models show in Fugure \begin_inset LatexCommand \ref{cap:Graphical-depiction-of} \end_inset are called Probabilistic Undirected Graphical Models / Markov Networks / Random Fields. The arcs in the graphs denote the direct dependencies. By complementarity the absence of an arc between two nodes denotes the absence of a direct dependency. This does not necessarily mean independence, and most often than not, it isn't independence. Two nodes are going to be indirectly dependent if they are going to be connected in the graph by a path containing more than one edge. The absence of an arc between two nodes means that there may be situations where we if we know some of the values of nodes on paths connecting our two nodes, this might yield the two nodes independent. More exactly in order to get independence between two indirectly connected nodes (connected by a path of length grater that one) we have to know the value of at least one node on each of the paths between these two nodes. The knowledge of the values of these nodes is going to \begin_inset Quotes eld \end_inset block \begin_inset Quotes erd \end_inset the indirect dependence between our two unconnected nodes, thus yielding them independent. \layout Standard Now that we have explained the intuition behind the graphical representation we are going to concern ourselves with a way of capitalising on this intuition and try to derive a formula for the probability of a data-point (sequence in our case). That will allow us to provide \begin_inset Formula $P_{PM}(\overline{S})$ \end_inset which is all that is needed in order for our classifcation scheme to work. In the independence assuming case ( Figure \begin_inset LatexCommand \ref{cap:Graphical-depiction-of} \end_inset a) ) the formula for our 5 elements of the sequence is as follows: \begin_inset Formula \[ P(\overline{S}=[S_{1}=s_{1},...,S_{5}=s_{5}])=\prod_{i=1}^{5}P(S_{i}=s_{i})\] \end_inset \layout Standard and in general for a sequence of \begin_inset Formula $n$ \end_inset elements we have: \layout Standard \begin_inset Formula \[ P(\overline{S}=[S_{1}=s_{1},...,S_{n}=s_{n}])=\prod_{i=1}^{n}P(S_{i}=s_{i})\] \end_inset \layout Subsubsection \begin_inset LatexCommand \label{sub:UNPkgP-&-NB} \end_inset UNPkgP & NB k-grams \layout Standard One intuitive way to try to define a probability distribution for the case of the graphical model representing pairwise dependencies ( Figure \begin_inset LatexCommand \ref{cap:Graphical-depiction-of} \end_inset b) ) is to take the product of the marginals of the directly dependent node as proposed by the following formula ( for our 5 elements sequence from Figure \begin_inset LatexCommand \ref{cap:Graphical-depiction-of} \end_inset b) ) [Since the nodes are pairwise dependent the information in the pairwise marginals is presumably important and we cannot derive it from products of single variable marginals]: \layout Standard \begin_inset Formula \[ P(\overline{S}=[S_{1}=s_{1},...,S_{5}=s_{5}])\;?=\prod_{i=1}^{4}P(S_{i}=s_{i},S_{i+1}=s_{i+1})\] \end_inset \layout Standard This will not produce a probability distribution however, and the intuition behind it is that we are somehow \begin_inset Quotes eld \end_inset double counting \begin_inset Quotes erd \end_inset or more exactly \begin_inset Quotes eld \end_inset double multiplying \begin_inset Quotes erd \end_inset the influence of some of the nodes (such as \begin_inset Formula $S_{2}=s_{2},S_{3}=s_{3},S_{4}=s_{4}$ \end_inset ; \begin_inset Formula $S_{2}=s_{2}$ \end_inset for example appears both in \begin_inset Formula $P(S_{1}=s_{1},S_{2}=s_{2})$ \end_inset and \begin_inset Formula $P(S_{2}=s_{2},S_{3}=s_{3})$ \end_inset ). \layout Standard The are two ways to deal with this problem. The first one is to ignore the double counting and in order to get a probabilit y distribution just normalise it over all the possible values that the elements in the sequence can take. So in general we will have \layout Standard \begin_inset Formula \[ P(\overline{S}=[S_{1}=s_{1},...,S_{n}=s_{n}])=\frac{1}{Z}\prod_{i=1}^{n-1}P(S_{i}=s_{i},S_{i+1}=s_{i+1})\] \end_inset \layout Standard Where \begin_inset Formula $Z$ \end_inset is given by the following formula: \layout Standard \begin_inset Formula \[ Z=\sum_{S=(s_{1},...,s_{n})\in V^{n}}\prod_{i=1}^{n-1}P(S_{i}=s_{i},S_{i+1}=s_{i+1})\] \end_inset \layout Standard We will call this model Probabilistic 2-gram Propositionalisation or shortly P2gP. More generally, for the case of k-wise direct dependencies we will call this model Probabilistic k-gram Propositionalisation or shortly PkgP. \layout Standard \begin_inset Formula \[ P(\overline{S}=[S_{1}=s_{1},...,S_{n}=s_{n}])=\frac{1}{Z}\prod_{i=1}^{n-k+1}P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1})\] \end_inset \layout Standard and \begin_inset Formula $Z$ \end_inset being given by: \layout Standard \begin_inset Formula \[ Z=\sum_{S=s_{1},...,s_{n}\in V^{n}}\prod_{i=1}^{n-k+1}P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1})\] \end_inset \layout Standard Note that the normalisation constant \begin_inset Formula $Z$ \end_inset (a.k.a. partition function) is a very complicated thing to calculate and with the exception of the cases when we have very small sequences we cannot compute it. One possible way to deal with this problem is to just ignore it, or basically set \begin_inset Formula $Z$ \end_inset to1 and use the corresponding results (which are not going to be true probabili ties) instead of the true probability \begin_inset Formula $p(\overline{S})$ \end_inset when doing classification. We will call this model UnNormalised Probabilistic k-gram Propositionalisation or in short UNPkgP. This model turns out, from the point of view of classification to be the same as transforming the sequence into a bag of overlapping k-grams and using the Naive Bayes classifier on it. Here is a more precise statement of the claim and the prove for it. \begin_inset Note collapsed true \layout Standard (for a proof claim see \begin_inset LatexCommand \cite{silvescu04inter-element} \end_inset ). \end_inset \layout Standard \series bold Claim: \series default \emph on The class conditional (classifier) version of UNPkgP [denoted C-UNPkgP] yields the the same decision as Naive Bayes on k-grams [denoted NB k-grams]. \layout Standard \emph on Proof. \layout Standard \begin_inset Formula \[ C-UNPkgP=\arg\max_{c_{j}\in C}P_{C-UNPkgP(c_{j})}(\overline{S}=[S_{1}=s_{1},...,S_{n}=s_{n}])P(c_{j})\] \end_inset \begin_inset Formula \begin{equation} =\arg\max_{c_{j}\in C}\prod_{i=1}^{n-k+1}P_{C-UNPkgP(c_{j})}(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1})P(c_{j})\label{eq:claim1}\end{equation} \end_inset \layout Standard \begin_inset Formula \begin{equation} =\arg\max_{c_{j}\in C}\prod_{i=1}^{n-k+1}P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1}|c_{j})P(c_{j})\label{eq:claim2}\end{equation} \end_inset \layout Standard \begin_inset Formula \[ =\arg\max_{c_{j}\in C}\prod_{i=1}^{n-k+1}P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1}|c_{j})P(c_{j})=NB\: k-grams\] \end_inset Where the transition from equation ( \begin_inset LatexCommand \ref{eq:claim1} \end_inset ) to equation ( \begin_inset LatexCommand \ref{eq:claim2} \end_inset ) is due to the fact that \begin_inset Formula $P_{C-UNPkgP(c_{j})}(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1})$ \end_inset is estimated in the same way as \begin_inset Formula $P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1}|c_{j})$ \end_inset from the training data , namely as: \layout Standard \begin_inset Formula \[ P(s_{1},...,s_{k}|c_{j})=\frac{counts(s_{1},...,s_{k},c_{j})}{counts(c_{j})}\] \end_inset \layout Standard (see Section \begin_inset LatexCommand \ref{sub:Training} \end_inset Training ... for more details) \begin_inset Formula $\Box$ \end_inset \layout Standard As a consequence of this claim we are going to us Naive Bayes applied to k-grams as our first proposes method which we shoed to be identical in classification results to turning or UNPkgP probabilistic model into a classifier. \layout Subsubsection \begin_inset LatexCommand \label{sub:NB(k)} \end_inset NB(k) \layout Standard In a different vein however we can try to fix the problem of \begin_inset Quotes eld \end_inset double multiplying \begin_inset Quotes erd \end_inset by dividing by the probabilities associated with the double-counted nodes. That is, for our 5 letter sequence from Figure \begin_inset LatexCommand \ref{cap:Graphical-depiction-of} \end_inset b): \begin_inset Formula \[ P(\overline{S}=[S_{1}=s_{1},...,S_{5}=s_{5}])\;?=\frac{\prod_{i=1}^{4}P(S_{i}=s_{i},S_{i+1}=s_{i+1})}{\prod_{i=2}^{4}P(S_{i}=s_{i})}\] \end_inset \layout Standard It turns out that this is the right thing to do and what we get is a full blown probability distribution which is the \begin_inset Quotes eld \end_inset correct \begin_inset Quotes erd \end_inset one. The generalisation for \begin_inset Formula $k>2$ \end_inset is as follows: \layout Standard \begin_inset Formula \begin{equation} P(\overline{S}=[S_{1}=s_{1},...,S_{n}=s_{n}])\;=\frac{\prod_{i=1}^{n-k+1}P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1})}{\prod_{i=2}^{n-k+1}P(S_{i}=s_{i},...,S_{i+k-2}=s_{i+k-2})}\label{eq:nbkp}\end{equation} \end_inset \layout Standard The obtained probability model is \begin_inset Quotes eld \end_inset correct \begin_inset Quotes erd \end_inset in the following sense: It is the same probability model that would be obtained from considering a Markov Network / Random Field over the sequence, whose graph is constructed by drawing direct arcs between all k consecutive elements of the sequence. Additionally the parameters for the marginal probabilities of each k consecutiv e elements (the maximal cliques of these graphs) are shared for all positions. We are producing in this case a position independent model (or equivalently the probability distribution associated with it is stationary). More exactly if we model dependence among k attributes we will call the model NB(k) (from Naive Bayes of k). Note that in this case NB(1) is really the independence case which yields Naive Bayes. \layout Standard The fact that the probability distribution obtained in the previous formula is correct is supported by the Junction Tree Theorem (see \begin_inset LatexCommand \cite{cowell99probabilistic} \end_inset ): \layout Standard \series bold Theorem \series default \emph on (Junction Tree): Let the Markov Network for a set of random variables be described by a chordal graph G, then the probability distribution associated with the Markov network can be written as the product of the marginals of the maximal cliques divided by the product of the marginals of the separator s. i.e., \layout Standard \begin_inset Formula \[ P(\overline{X}=[X_{1}=v_{1},...,X=v_{n}])\;=\frac{\prod_{c\in MaxCliques}P(X_{c}=x_{c})}{\prod_{s\in Separators}P(X_{s}=x_{s})}\] \end_inset \layout Standard Before proceeding with a proof sketch a few potentially useful \series bold Definitions: \layout Standard Definition: ( \series bold \emph on elementary cycle \series default \emph default ) A cycle is called elementary if it contaitn no sub-cycles. \layout Standard Definition:( \series bold \emph on chordal graph \series default \emph default ) A graph G is chordal iff it has no elementary cycles of length bigger that 3. \layout Standard Definition: \series bold \emph on (clique \series default \emph default ) A clique is a set of nodes from a graph G such that the is an edge between every pair of nodes in the clique. \layout Standard Definition:( \series bold \emph on maximal clique \series default \emph default ): A clique c from a graph G is called maximal clique if any superset of nodes that contains the cliques c is not a clique. \layout Standard Definition:( \series bold \emph on separator \series default \emph default ) Let \begin_inset Formula $c_{1}$ \end_inset and \begin_inset Formula $c_{2}$ \end_inset two cliques from a graph G. The separator of these two cliques \begin_inset Formula $c_{1}$ \end_inset and \begin_inset Formula $c_{2}$ \end_inset is \begin_inset Formula $c_{1}\cap c_{2}$ \end_inset . \layout Standard \series bold Remark \series default : It is obvious that the sequence graphs that we are dealing with are chordal and the application of the theorem will give us our previously derived formula i.e., equation ( \begin_inset LatexCommand \ref{eq:nbkp} \end_inset ). \layout Standard \emph on Proof Idea \emph default . We will give a proof of the theorem for the sequence graphs that we are dealing with and refer the reader to \begin_inset LatexCommand \cite{cowell99probabilistic} \end_inset for the proof in the case of general chordal graphs. \layout Standard Intuitive example: For our sequence of 5 elements the probability of a sequence is as follows: \layout Standard \begin_inset Formula \[ P(\overline{S}=[S_{1}=s_{1},...,S_{5}=s_{5}])=\frac{\prod_{i=1}^{4}P(S_{i}=s_{i},S_{i+1}=s_{i+1})}{\prod_{i=2}^{4}P(S_{i}=s_{i})}\] \end_inset \layout Standard \begin_inset Formula \[ =\frac{P(S_{1}=s_{1},S_{2}=s_{2})P(S_{2}=s_{2},S_{3}=s_{3})P(S_{3}=s_{3},S_{4}=s_{4})P(S_{4}=s_{4},S_{5}=s_{5})}{P(S_{2}=s_{2})P(S_{3}=s_{3})P(S_{4}=s_{4})}\] \end_inset \layout Standard where if we write \begin_inset Formula $P(S_{1}=s_{1},S_{2}=s_{2})=P(S_{1}=s_{1})P(S_{2}=s_{2}|S_{1}=s_{1})$ \end_inset and if we also group together the i+1-th term from the numerator with the i-th term from the denominator into a conditional. For example for the 2-nd term from numerator and the 1-st term from the denominator we have: \begin_inset Formula $P(S_{2}=s_{2},S_{3}=s_{3})/P(S_{2}=s_{2})=P(S_{3}=s_{3}|S_{2}=s_{2})$ \end_inset we get: \layout Standard \begin_inset Formula \[ P(\overline{S}=[S_{1}=s_{1},...,S_{5}=s_{5}])=\] \end_inset \layout Standard \begin_inset Formula \[ P(S_{1}=s_{1})P(S_{2}=s_{2}|S_{1}=s_{1})P(S_{3}=s_{3}|S_{2}=s_{2})P(S_{4}=s_{4}|S_{3}=s_{3})P(S_{5}=s_{5}|S_{4}=s_{4})\] \end_inset \layout Standard which is \begin_inset Formula $P(\overline{S}=[S_{1}=s_{1},...,S_{5}=s_{5}])$ \end_inset given the independences represented by the graphical model. \layout Standard Similarly we can show that for k-wise dependencies: \layout Standard \begin_inset Formula \begin{equation} P(\overline{S}=[S_{1}=s_{1},...,S_{n}=s_{n}])\;=\frac{\prod_{i=1}^{n-k+1}P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1})}{\prod_{i=2}^{n-k+1}P(S_{i}=s_{i},...,S_{i+k-2}=s_{i+k-2})}\label{eq:pnbk}\end{equation} \end_inset \layout Standard \begin_inset Formula \begin{equation} =P(S_{1}=s_{1},...,S_{k-1}=s_{k-1})\prod_{i=k}^{n}P(S_{i}=s_{i}|S_{i-1}=s_{i-1},...,S_{i-k+1}=s_{i-k+1})\label{eq:pmmk}\end{equation} \end_inset \layout Standard Which proves the theorem for sequence graphs. \begin_inset Formula $\Box$ \end_inset \layout Standard Note that the previous formula is identical to the probability that a Markov(k-1 ) model will assign. Which brings us to the next Section. \layout Subsubsection Directed Graphical Model Interpretation of NB(k) and the Random Fields vs. Markov models debate. \layout Standard If we are to construct a directed graphical model instead of an undirected one we will get, for our sequence of length 5, the graphs depicted in Figure \begin_inset LatexCommand \ref{cap:Bayesian-Network-(Markov} \end_inset a) b) c) for independence, pairwise and 3-wise dependency respectively. These are graphs representing the Markov(k-1) models in the general case (see \begin_inset LatexCommand \cite{charniak93statistical,ewens01statistical} \end_inset ). (Note that the Markov(k-1) models dependency among k elements: the previous k-1 and current one, hence the correspondence with NB(k) and not with NB(k-1)) \layout Standard The probability distributions represented by the undirected graphical models are the same as the one of the directed one as mentioned in the Proof Idea of the Junction Tree Theorem. More exactly we have shown that \begin_inset Formula $P(\overline{S}=[S_{1}=s_{1},...,S_{n}=s_{n}])$ \end_inset can be written either as in equation ( \begin_inset LatexCommand \ref{eq:pnbk} \end_inset ) which is the formula for the undirected model or as in equation ( \begin_inset LatexCommand \ref{eq:pmmk} \end_inset ) which is the formula for the directed / Bayesian network / Markov(k-1) model. \layout Standard \begin_inset Float figure wide false collapsed false \layout Standard \begin_inset Graphics filename sNBkBN.eps \end_inset \layout Caption \begin_inset LatexCommand \label{cap:Bayesian-Network-(Markov} \end_inset Bayesian Network (Markov Model) representation of the k-wise dependency. a) shows the Naive Bayes = independence model; b) shows pairwise dependence and c) shows 3-wise dependence \end_inset \layout Standard We may think that Markov Models, since they are time oriented probabilistic models, might be less suited for modelling sequences that do not have this temporal dimension but where the dimension along which the sequence spans of is rather spatial than temporal nature (as in the case of protein sequences vs. speech utterances for example). As a consequence a need for using undirected graphical models may be felt. We have proved however [see Proof Idea] that the Markov(k-1) Models (temporally oriented) and the Markov Networks / Random Fields (undirected - spatially based) / NB(k) models produce the same probability distribution. Hence, which type of model to choose becomes more of a matter of preference or taste rather than one of choosing between models that yield different probabilities. \layout Subsection Training, i.e., Probability Estimation from Data \begin_inset LatexCommand \label{sub:Training} \end_inset \layout Standard In order to train our models we need to get estimates from the probabilities \begin_inset Formula $P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1})$ \end_inset for each possible class value \begin_inset Formula $c_{j}$ \end_inset . Because of the stationarity assumption these probabilities are going to be the same for each \begin_inset Formula $i$ \end_inset so \begin_inset Formula $P(S_{1}=s_{1},...,S_{k}=s_{k}|c_{j})=P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1}|c_{j})$ \end_inset for all values of \begin_inset Formula $i$ \end_inset . We will denote their common value for a set of elements \begin_inset Formula $s_{1},...,s_{k}$ \end_inset and a class \begin_inset Formula $c_{j}$ \end_inset by notation abuse with \begin_inset Formula $P(s_{1},...,s_{k}|c_{j})$ \end_inset . We will estimate this quantity in the standard way as follows: \begin_inset Formula \[ P(s_{1},...,s_{k}|c_{j})=\frac{counts(s_{1},...,s_{k},c_{j})}{counts(c_{j})}\] \end_inset Where \begin_inset Formula $counts(s_{1},...,s_{k},c_{j})$ \end_inset are the number of times the elements \begin_inset Formula $s_{1},...,s_{k}$ \end_inset appear in a sequence of belonging to class \begin_inset Formula $c_{j}$ \end_inset at any position and \begin_inset Formula $counts(c_{j})$ \end_inset is how many such k-grams exist within sequences from class \begin_inset Formula $c_{j}$ \end_inset . \layout Standard In the case of directed models the estimation can be done in a similar way namely by the following formula: \layout Standard \begin_inset Formula \[ P(s_{1}|s_{2},...,s_{k},c_{j})=\frac{counts(s_{1},...,s_{k},c_{j})}{counts(s_{2},...,s_{k},c_{j})}\] \end_inset \layout Standard Now that we have showed how to fit the parameters of the model we are ready to test it experimentally. \layout Standard Furthermore in order to avoid zero probabilities and to get a mild regularisatio n effect we used the Laplace correction for the previous probability estimates (Which basically initialises the counts with 1 instead of 0 and adds to the denominator an appropriate count in order to ensure that the probabilities sum up to 1) . \layout Section Experimental setup and results \layout Standard \begin_inset Note collapsed true \layout Standard We will test the two classification algorithms associated with the probabilistic models derived in Sections \begin_inset LatexCommand \ref{sub:UNPkgP-&-NB} \end_inset and \begin_inset LatexCommand \ref{sub:NB(k)} \end_inset against Naive Bayes. More precisely we will test Naive Bayes, NB k-grams and NB(k) on three protein function classification datasets. \layout Standard \series bold Datasets: \series default The three Datasets are as follows: The first one Kinase, contains 5 classes of Kinases amounting to a total of 396 proteins with the following per class count (102, 209, 69, 10, 6); The second one Kinase/Isomerase aims to discriminate between Kinases and Ligases (i.e., a two class problem) and contains a total of 376 protein sequences with a per class count of (158, 218); The third dataset Kinase/Ligase/Helicase/Isomerase contains four classes of proteins: Kinases, Ligases, Helicases and Isomerases respectively, amounting to 572 proteins with a per class count of (158, 218, 110, 86). All the protein functions have been Extracted from GO (the Gene Ontology, \begin_inset LatexCommand \cite{gene00gene} \end_inset ) and the corresponding protein sequences were extracted from SWISSPROT \begin_inset LatexCommand \cite{bairoch00swiss-prot} \end_inset . The datasets are available for download at \begin_inset LatexCommand \cite{andorf04pcdatasets} \end_inset . \layout Standard \series bold Experiments and Results: \series default The computational experiments were motivated by the following questions: \layout Standard How do the two presented methods NB k-grams and NB(k) compare among each other and against the baseline represented by Naive Bayes(i.e., NB(1), a.k.a. NB 1-grams)? \layout Standard Furthermore we inquire upon the effect that the number of dependencies, k, has on the classification accuracy of the dependency ? \layout Standard We ran each algorithm with the size of dependency k sizes ranging from 1 to 4 (except for Naive Bayes, of course). For sizes of k larger than 4, e.g., for 5 we need to calculate over three million probabilities (= \begin_inset Formula $20^{5}$ \end_inset ) per class the size of the data doe not allow us to reliably estimate these probabilities. The accuracy estimates in all the tables are based on stratified 10-fold cross validation. \layout Standard \begin_inset Float table wide false collapsed false \layout Standard \begin_inset Tabular \begin_inset Text \layout Standard \SpecialChar ~ k \end_inset \begin_inset Text \layout Standard NB \end_inset \begin_inset Text \layout Standard NB k-grams \end_inset \begin_inset Text \layout Standard NB(k) [<=>MM(k-1)] \end_inset \begin_inset Text \layout Standard Improvement \end_inset \begin_inset Text \layout Standard 1 \end_inset \begin_inset Text \layout Standard 66.1 \end_inset \begin_inset Text \layout Standard 66.1 \end_inset \begin_inset Text \layout Standard 66.1 \end_inset \begin_inset Text \layout Standard 0.0 \end_inset \begin_inset Text \layout Standard 2 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 81.3 \end_inset \begin_inset Text \layout Standard 88.6 \end_inset \begin_inset Text \layout Standard 7.3 \end_inset \begin_inset Text \layout Standard 3 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 89.9 \end_inset \begin_inset Text \layout Standard 92.7 \end_inset \begin_inset Text \layout Standard 2.8 \end_inset \begin_inset Text \layout Standard 4 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 90.4 \end_inset \begin_inset Text \layout Standard 91.6 \end_inset \begin_inset Text \layout Standard 1.6 \end_inset \end_inset \layout Caption \begin_inset LatexCommand \label{cap:Kinase} \end_inset Kinase dataset results. \end_inset \layout Standard \begin_inset Float table wide false collapsed false \layout Standard \begin_inset Tabular \begin_inset Text \layout Standard \SpecialChar ~ k \end_inset \begin_inset Text \layout Standard NB \end_inset \begin_inset Text \layout Standard NB on k-grams \end_inset \begin_inset Text \layout Standard NB(k) [<=>MM(k-1)] \end_inset \begin_inset Text \layout Standard Improvement \end_inset \begin_inset Text \layout Standard 1 \end_inset \begin_inset Text \layout Standard 77.9 \end_inset \begin_inset Text \layout Standard 77.9 \end_inset \begin_inset Text \layout Standard 77.9 \end_inset \begin_inset Text \layout Standard 0.0 \end_inset \begin_inset Text \layout Standard 2 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 83.5 \end_inset \begin_inset Text \layout Standard 84.6 \end_inset \begin_inset Text \layout Standard 1.1 \end_inset \begin_inset Text \layout Standard 3 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 84.0 \end_inset \begin_inset Text \layout Standard 85.6 \end_inset \begin_inset Text \layout Standard 1.6 \end_inset \begin_inset Text \layout Standard 4 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 85.9 \end_inset \begin_inset Text \layout Standard 90.7 \end_inset \begin_inset Text \layout Standard 4.8 \end_inset \end_inset \layout Caption \begin_inset LatexCommand \label{cap:K_L} \end_inset Kinase / Ligase dataset results. \end_inset \layout Standard \begin_inset Float table wide false collapsed false \layout Standard \begin_inset Tabular \begin_inset Text \layout Standard \series bold \SpecialChar ~ \series default k \end_inset \begin_inset Text \layout Standard NB \end_inset \begin_inset Text \layout Standard NB on [k+1]-grams \end_inset \begin_inset Text \layout Standard NB(k) [<=>MM(k-1)] \end_inset \begin_inset Text \layout Standard Improvement \end_inset \begin_inset Text \layout Standard 1 \end_inset \begin_inset Text \layout Standard 56.1 \end_inset \begin_inset Text \layout Standard 56.1 \end_inset \begin_inset Text \layout Standard 56.1 \end_inset \begin_inset Text \layout Standard 0.0 \end_inset \begin_inset Text \layout Standard 2 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 70.3 \end_inset \begin_inset Text \layout Standard 72.2 \end_inset \begin_inset Text \layout Standard 1.9 \end_inset \begin_inset Text \layout Standard 3 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 79.5 \end_inset \begin_inset Text \layout Standard 80.8 \end_inset \begin_inset Text \layout Standard 1.3 \end_inset \begin_inset Text \layout Standard 4 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 79.4 \end_inset \begin_inset Text \layout Standard 82.0 \end_inset \begin_inset Text \layout Standard 2.6 \end_inset \end_inset \layout Caption \begin_inset LatexCommand \label{cap:K_L_H_I} \end_inset Kinase / Ligase / Helicase / Isomerase dataset results. \end_inset \layout Standard The experimental results show the superiority of both NB k-grams and NB(k) over Naive Bayes on all test cases. Furthermore NB(k) ended up being superior in accuracy to NB k-grams on all test case. In the case of the Kinase dataset for NB(4) while still doing better than NB 4-grams did worse that NB(3). This result is probably due to an insufficient amount of data available for the estimation of the probabilities. \end_inset \layout Standard We will test the two classification algorithms associated with the probabilistic models derived in Sections \begin_inset LatexCommand \ref{sub:UNPkgP-&-NB} \end_inset and \begin_inset LatexCommand \ref{sub:NB(k)} \end_inset against Naive Bayes. More precisely we will test Naive Bayes, NB k-grams and NB(k) on three protein function classification datasets. \layout Subsection Data Sets \layout Standard The first dataset is based on families of yeast and human kinases. These families were chosen for this study because many of them are well-charact erized, with known structures and functions. The data set used in this study consisted of proteins belonging to the Gene Ontology, functional family GO0004672, Protein Kinase Activity. We classified them according to the highest level below GO0004672. This consists of 5 groups. In GO their labels are GO0004672, Protein Kinase Activity; GO0004674, protein serine/threonine kinase activity; GO0004713, protein-tyrosine kinase activity; GO0004712, protein threonine/tyrosine kinase activity; and GO0004716, receptor signaling protein tyrosine kinase activity. These families represent all the yeast and human proteins that can be extracted from SWISSPROT that have the given GO classification. Since GO is represented as a directed acyclic graph, some proteins may be represented in multiple classifications. We filter the data to remove any multi-labeled proteins to guarantee disjoint classes. This dataset includes a total of 396 proteins: 102 proteins from GO0004672; 209 proteins from GO0004674, 69 proteins from GO0004713, 10 proteins from GO0004712, and 6 proteins from GO0004716. \layout Standard The second dataset is based on two subfamilies of GO0003824, Catalytic Activity. This division is at a higher level of GO then the previous dataset. The data set used in this study consisted of proteins belonging to the Gene Ontology functional family GO0004672, Protein Kinase Activity and GO001684, Protein Ligase Activity. These families represent all the yeast proteins that can be extracted from SWISSPROT that have the given GO classification.. This data includes a total 376 proteins: 158 proteins from Kinase, and 218 proteins from Ligase. \layout Standard The third dataset is a super set of dataset two. It contains the Kinase data and Ligase data in addition to two other subfamilie s of GO0003824, Catalytic Activity. These families are GO0004386, Protein Helicase Activity, and GO0016853, Protein Isomerase Activity. This dataset tests the classifiers ability on a larger number of classes at a high level of GO. This data includes a total of 572 proteins: 158 proteins from Kinase, 218 proteins from Ligase, 110 proteins from Helicase, and 86 proteins from Isomerase. \layout Standard [ \series bold Shorter version \series default ] \series bold Datasets: \series default The three Datasets are as follows: The first one Kinase, contains 5 classes of Kinases amounting to a total of 396 proteins with the following per class count (102, 209, 69, 10, 6); The second one Kinase/Isomerase aims to discriminate between Kinases and Ligases (i.e., a two class problem) and contains a total of 376 protein sequences with a per class count of (158, 218); The third dataset Kinase/Ligase/Helicase/Isomerase contains four classes of proteins: Kinases, Ligases, Helicases and Isomerases respectively, amounting to 572 proteins with a per class count of (158, 218, 110, 86). All the protein functions have been Extracted from GO (the Gene Ontology, \begin_inset LatexCommand \cite{gene00gene} \end_inset ) and the corresponding protein sequences were extracted from SWISSPROT \begin_inset LatexCommand \cite{bairoch00swiss-prot} \end_inset . The datasets are available for download at \begin_inset LatexCommand \cite{andorf04pcdatasets} \end_inset . \layout Subsection Experiments and Results \layout Standard The computational experiments were motivated by the following questions: \layout Standard How do the two presented methods NB k-grams and NB(k) compare among each other and against the baseline represented by Naive Bayes(i.e., NB(1), a.k.a. NB 1-grams)? \layout Standard Furthermore we inquire upon the effect that the number of dependencies, k, has on the classification accuracy of the dependency ? \layout Standard Tables \begin_inset LatexCommand \ref{cap:Kinase} \end_inset , \begin_inset LatexCommand \ref{cap:K_L} \end_inset and \begin_inset LatexCommand \ref{cap:K_L_H_I} \end_inset show the results of our experiments using the three datasets. For each dataset we ran the three separate algorithms Naive Bayes, (i.e., NB(1) a.k.a. NB 1-grams), Naive Bayes with k-grams, and NB(k). \layout Standard We ran each algorithm with the size of dependency k sizes ranging from 1 to 4 (except for Naive Bayes, of course). For sizes of k larger than 4, e.g., for 5 we need to calculate over three million probabilities (= \begin_inset Formula $20^{5}$ \end_inset ) per class; and we do not have enough data to reliably estimate these probabili ties. Therefore all of our data gets washed out by prior probabilities given by the Laplace correction and everything is classified as belonging to the largest class, the class with the highest probability of occurring. For our alphabet we chose the standard 20 letter amino acid alphabet (although using a smaller alphabet will decrease the complexity and therefore the runtime of these algorithms, it offers no benefit in terms of accuracy, neither of which is our ultimate goal, so we exclude these results \begin_inset LatexCommand \cite{andorf02discovering} \end_inset . Note that the reduced alphabets are obtained by grouping some amino acids under the same symbol and treating the as one type of entity). As already hinted all of our probability estimates for all methods used Laplace correction. The accuracy estimates in all the tables are based on stratified 10-fold cross validation. \layout Standard The tables \begin_inset LatexCommand \ref{cap:Kinase} \end_inset , \begin_inset LatexCommand \ref{cap:K_L} \end_inset and \begin_inset LatexCommand \ref{cap:K_L_H_I} \end_inset show from left to right: the dependency number k (k-wise dependency); the accuracy for Naive Bayes, the accuracy for Naive Bayes k-grams, the accuracy for NB(k).,and the improvement in accuracy of NB(k) over NB k-grams \layout Standard \begin_inset Float table wide false collapsed false \layout Standard \begin_inset Tabular \begin_inset Text \layout Standard \SpecialChar ~ k \end_inset \begin_inset Text \layout Standard NB \end_inset \begin_inset Text \layout Standard NB k-grams \end_inset \begin_inset Text \layout Standard NB(k) [<=>MM(k-1)] \end_inset \begin_inset Text \layout Standard Improvement \end_inset \begin_inset Text \layout Standard 1 \end_inset \begin_inset Text \layout Standard 66.1 \end_inset \begin_inset Text \layout Standard 66.1 \end_inset \begin_inset Text \layout Standard 66.1 \end_inset \begin_inset Text \layout Standard 0.0 \end_inset \begin_inset Text \layout Standard 2 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 81.3 \end_inset \begin_inset Text \layout Standard 88.6 \end_inset \begin_inset Text \layout Standard 7.3 \end_inset \begin_inset Text \layout Standard 3 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 89.9 \end_inset \begin_inset Text \layout Standard 92.7 \end_inset \begin_inset Text \layout Standard 2.8 \end_inset \begin_inset Text \layout Standard 4 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 90.4 \end_inset \begin_inset Text \layout Standard 91.6 \end_inset \begin_inset Text \layout Standard 1.6 \end_inset \end_inset \layout Caption \begin_inset LatexCommand \label{cap:Kinase} \end_inset Kinase (GO0004672) dataset results. There were a total of 396 proteins and five classes. k refers to the number of dependencies used in each algorithm. NB shows the accuracy of running Naive Bayes on the dataset. NB k-grams shows the accuracy of running Naive Bayes using k-grams on the dataset. NB(k) [<=>MM(k-1)] shows the accuracy of running NB(k) using k dependencies on the dataset. \end_inset . \layout Standard [ \series bold Can Be omitted in a short paper \series default ] Table \begin_inset LatexCommand \ref{cap:Kinase} \end_inset shows the results from the Kinase dataset. Using Naive Bayes alone we were able to get a 66% classification rate. By increasing our value of k to 2 we were able to increase accuracy to 81.3% for NB 2-grams and 88.6% with NB(2). For NB(2) this was over a 22% improvement over Naive Bayes and over 7% improvement over NB 2-grams. By increasing k to 3, we again see an improvement. NB 3-grams and NB(3) improved to 89.9% and 92.7% respectively. When we increase k again to 4 we get little improvement on this dataset. NB 4-grams improved by only 0.5% and NB(4) while still doing better than NB 4-grams actually decreased in accuracy over in NB(3). The main reason here is the poor estimates for the probability distribution. We are estimating \begin_inset Formula $20^{4}~160,000$ \end_inset probabilities per class. And we have 5 classes. \layout Standard Table \begin_inset LatexCommand \ref{cap:K_L} \end_inset , Kinase/Ligase dataset and Table \begin_inset LatexCommand \ref{cap:K_L_H_I} \end_inset , Kinase/Ligase/Isomerase/Helicase dataset show similar results. When increasing k from 2 to 3 we get a dramatic improvement with NB(3) over NB(2), over 5% in Table \begin_inset LatexCommand \ref{cap:K_L} \end_inset and nearly 15% in Table \begin_inset LatexCommand \ref{cap:K_L_H_I} \end_inset . As we increase k from 2 to 4 we again see improvement in both datasets. We can get up to a 90.7% accuracy in Table \begin_inset LatexCommand \ref{cap:K_L} \end_inset and 82% accuracy in Table \begin_inset LatexCommand \ref{cap:K_L_H_I} \end_inset . Another important observation is that NB(k) always outperformed NB k-grams. It is worth noting that both of these experiments used the same splits when doing cross validation and used the same way (Laplace correction) to estimate the probabilities when building classifiers. So there is no biased in terms of how these two algorithms were trained. \layout Standard \begin_inset Float table wide false collapsed false \layout Standard \begin_inset Tabular \begin_inset Text \layout Standard \SpecialChar ~ k \end_inset \begin_inset Text \layout Standard NB \end_inset \begin_inset Text \layout Standard NB k-grams \end_inset \begin_inset Text \layout Standard NB(k) [<=>MM(k-1)] \end_inset \begin_inset Text \layout Standard Improvement \end_inset \begin_inset Text \layout Standard 1 \end_inset \begin_inset Text \layout Standard 77.9 \end_inset \begin_inset Text \layout Standard 77.9 \end_inset \begin_inset Text \layout Standard 77.9 \end_inset \begin_inset Text \layout Standard 0.0 \end_inset \begin_inset Text \layout Standard 2 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 83.5 \end_inset \begin_inset Text \layout Standard 84.6 \end_inset \begin_inset Text \layout Standard 1.1 \end_inset \begin_inset Text \layout Standard 3 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 84.0 \end_inset \begin_inset Text \layout Standard 85.6 \end_inset \begin_inset Text \layout Standard 1.6 \end_inset \begin_inset Text \layout Standard 4 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 85.9 \end_inset \begin_inset Text \layout Standard 90.7 \end_inset \begin_inset Text \layout Standard 4.8 \end_inset \end_inset \layout Caption \begin_inset LatexCommand \label{cap:K_L} \end_inset Kinase(GO0004672) / Ligase(GO001684) dataset results. There were a total of 396 proteins and two classes. K refers to the number of dependencies used in each algorithm. NB shows the accuracy of running Naive Bayes on the dataset. NB k-grams shows the accuracy of running Naive Bayes using k-grams on the dataset. NB(k) [<=>MM(k-1)] shows the accuracy of running NB(k) using k dependencies on the dataset. \end_inset \layout Standard \begin_inset Float table wide false collapsed false \layout Standard \begin_inset Tabular \begin_inset Text \layout Standard \series bold \SpecialChar ~ \series default k \end_inset \begin_inset Text \layout Standard NB \end_inset \begin_inset Text \layout Standard NB k-grams \end_inset \begin_inset Text \layout Standard NB(k) [<=>MM(k-1)] \end_inset \begin_inset Text \layout Standard Improvement \end_inset \begin_inset Text \layout Standard 1 \end_inset \begin_inset Text \layout Standard 56.1 \end_inset \begin_inset Text \layout Standard 56.1 \end_inset \begin_inset Text \layout Standard 56.1 \end_inset \begin_inset Text \layout Standard 0.0 \end_inset \begin_inset Text \layout Standard 2 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 70.3 \end_inset \begin_inset Text \layout Standard 72.2 \end_inset \begin_inset Text \layout Standard 1.9 \end_inset \begin_inset Text \layout Standard 3 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 79.5 \end_inset \begin_inset Text \layout Standard 80.8 \end_inset \begin_inset Text \layout Standard 1.3 \end_inset \begin_inset Text \layout Standard 4 \end_inset \begin_inset Text \layout Standard X \end_inset \begin_inset Text \layout Standard 79.4 \end_inset \begin_inset Text \layout Standard 82.0 \end_inset \begin_inset Text \layout Standard 2.6 \end_inset \end_inset \layout Caption \begin_inset LatexCommand \label{cap:K_L_H_I} \end_inset Kinase(GO0004672) / Ligase(GO001684) / Helicase Activity(GO0004386) / Isomerase (GO0016853) dataset results. There was a total of 572 proteins and four classes. K refers to the number of dependencies used in each algorithm. NB shows the accuracy of running Naive Bayes on the dataset. NB k-grams shows the accuracy of running Naive Bayes using k-grams on the dataset. NB(k) [<=>MM(k-1)] shows the accuracy of running NB(k) using k dependencies on the dataset. \end_inset \layout Standard [To recapitulate] The experimental results show the superiority of both NB k-grams and NB(k) over Naive Bayes on all test cases. Furthermore NB(k) ended up being superior in accuracy to NB k-grams on all test case. In the case of the Kinase dataset for NB(4) while still doing better than NB 4-grams did worse that NB(3). This result is probably due to an insufficient amount of data available for the estimation of the probabilities. \layout Standard Table \begin_inset LatexCommand \ref{cap:best-compared} \end_inset reports the best k size for each of the two methods. In this table model size refers to the overall alphabet size used by the algorithm or equivalently how many probabilities per class we estimate. It is computed by taking the alphabet size and raising it to the k-th power. For example an alphabet size of 20, our amino acid alphabet, with a k size of 4 would have a \begin_inset Formula $20^{4}$ \end_inset = 160,000 letter alphabet. This is also, again the number of probabilities per class needed to be estimated for each algorithm. Note that in the case of Naive Bayes we estimate 20 probabilities per class. Also note that the estimations of these big probabilities are very fast as long as the table fits into memory (it involves one addition and one multiplication per entry into the table). Table \begin_inset LatexCommand \ref{cap:best-compared} \end_inset shows that k equal to 3 or 4 is usually optimal in terms of classification. \layout Standard \begin_inset Float table wide false collapsed false \layout Standard \begin_inset Tabular \begin_inset Text \layout Standard \end_inset \begin_inset Text \layout Standard NB k-grams \end_inset \begin_inset Text \layout Standard \end_inset \begin_inset Text \layout Standard \end_inset \begin_inset Text \layout Standard NB(k) \end_inset \begin_inset Text \layout Standard \end_inset \begin_inset Text \layout Standard \end_inset \begin_inset Text \layout Standard \end_inset \begin_inset Text \layout Standard DataSet \end_inset \begin_inset Text \layout Standard k \end_inset \begin_inset Text \layout Standard Model Size \end_inset \begin_inset Text \layout Standard Acc. \end_inset \begin_inset Text \layout Standard k \end_inset \begin_inset Text \layout Standard Model Size \end_inset \begin_inset Text \layout Standard Acc. \end_inset \begin_inset Text \layout Standard Improvement \end_inset \begin_inset Text \layout Standard Kinase \end_inset \begin_inset Text \layout Standard 4 \end_inset \begin_inset Text \layout Standard 160,000 \end_inset \begin_inset Text \layout Standard 90.4 \end_inset \begin_inset Text \layout Standard 3 \end_inset \begin_inset Text \layout Standard 8000 \end_inset \begin_inset Text \layout Standard 92.7 \end_inset \begin_inset Text \layout Standard 2.3 \end_inset \begin_inset Text \layout Standard K/L \end_inset \begin_inset Text \layout Standard 4 \end_inset \begin_inset Text \layout Standard 160,000 \end_inset \begin_inset Text \layout Standard 85.9 \end_inset \begin_inset Text \layout Standard 4 \end_inset \begin_inset Text \layout Standard 160,000 \end_inset \begin_inset Text \layout Standard 90.7 \end_inset \begin_inset Text \layout Standard 4.8 \end_inset \begin_inset Text \layout Standard K/L/H/I \end_inset \begin_inset Text \layout Standard 3 \end_inset \begin_inset Text \layout Standard 8000 \end_inset \begin_inset Text \layout Standard 79.5 \end_inset \begin_inset Text \layout Standard 4 \end_inset \begin_inset Text \layout Standard 160,000 \end_inset \begin_inset Text \layout Standard 82.0 \end_inset \begin_inset Text \layout Standard 2.5 \end_inset \end_inset \layout Caption \begin_inset LatexCommand \label{cap:best-compared} \end_inset A comparison of the best accuracy for each dataset using both the Naive Bayes with k-grams and NB(k) methods. k refers to the number of dependencies modelled by each algorithm. Model size refers to the total number of probabilities per class computed by each algorithm for the given value of k. Acc. refers to the accuracy. For each algorithm we show the value k for which it obtained the best accuracy on each dataset. Improvement shows the percent improvement that NB(k) had over NB with k-grams. \end_inset \layout Section Summary and Discussion \layout Standard To recapitulate, we presented two methods that relax the independence assumption of Naive Bayes by modelling interaction among close neighbours in the sequence. The two methods model k-wise dependency among successive elements in sequence data. Despite the modeling accuracy gain, the two proposed algorithms maintain the \begin_inset Quotes eld \end_inset one pass through the data only \begin_inset Quotes erd \end_inset training property of Naive Bayes. \layout Standard We put to test our methods on three protein function prediction problems with functional classes extracted from GO (the Gene Ontology, \begin_inset LatexCommand \cite{gene00gene} \end_inset and corresponding sequences extracted from SWISSPROT \begin_inset LatexCommand \cite{bairoch00swiss-prot} \end_inset . The proposed methods compared favorably with Naive Bayes. Furthermore NB(k) showed (as expected) the best performance in terms of accuracy on all test cases (over NB k-grams). NB(k) is a natural generalization of Naive Bayes (for k=1) and it constitutes and end in itself. That is, under the stationarity assumption it is the most complete and correct probabilistic model of k-wise interaction among k successive elements of the sequence. \layout Standard The improved modelling accuracy and also the demonstrated improvement in test accuracy for increased values of k need to be taken with a grain of salt. That is they will hold, under the proviso that we have enough data to fit the model probabilities. \layout Standard We have also shown the equivalence of the NB(k) undirected model with the Markov Model of order k-1. This contributes to dispelling a misconception that undirected models (i.e., random fields) are more suitable than directed models such as Markov Models for modelling protein sequences since the order in the protein sequence has a spatial nature rather than a temporal one as it is the case with Markov Models. While we do not dispute the spatial nature of protein sequences, we show that both the undirected (spatial = NB(k) ) and the directed models (temporal = Markov(k-1) Models) associated with with k-wise dependency yield the same probability distribution. Additionally NB(k), which is not incidentally also our best performer, represents and end in itself since it is the most complete undirected graphical model for k-grams modelling. \layout Standard Some directions for future work include: Studying the effect of varying the alphabet size on the resulting accuracy.More in-depth examination of the results, addressing questions such as: Are NB(k) picking out probabilities similar to those of multiple sequence alignment motifs? A more in-depth study using a wider range of datasets such as Intrusion Detection, ... . \layout Standard \begin_inset LatexCommand \BibTeX[plain]{D:/Users/docs/lyx/_bibliography/bib} \end_inset \the_end