#LyX 1.3 created this file. For more info see http://www.lyx.org/
\lyxformat 221
\textclass article
\language english
\inputencoding auto
\fontscheme times
\graphics default
\paperfontsize 10
\spacing single
\papersize Default
\paperpackage a4
\use_geometry 0
\use_amsmath 0
\use_natbib 0
\use_numerical_citations 0
\paperorientation portrait
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\defskip medskip
\quotes_language english
\quotes_times 2
\papercolumns 1
\papersides 1
\paperpagestyle default
\layout Title
Inter-element dependency models for sequence classification
\layout Author
Adrian Silvescu, Carson Andorf, Drena Dobbs, Vasant Honavar
\layout Date
\SpecialChar ~
\layout Standard
Artificial Intelligence Laboratory
\layout Standard
Department of Computer Science and
\layout Standard
Graduate Program in Bioinformatics and Computational Biology
\layout Standard
Iowa State University
\layout Standard
Ames, IA 50011, USA
\layout Standard
{silvescu|andorfc|ddobs|honavar}@iastate.edu
\layout Abstract
Naive Bayes is a fast to train model for sequence classification.
We develop and experiment with two methods that are equally fast (they
require only one pass through the training data), but they are also able
to models interactions among close neighbours in the sequence (unlike the
Naive bayes independence assumption).
The first method basically runs Naive Bayes on overlapping k-grams obtained
from the sequence.
The second method, called NB(k), constructs a classifier associated with
an undirected graphical model over the sequence that takes into account
k-wise dependencies.
We test our algorithms on protein function classification tasks based on
functional families from the Gene Ontology (GO) database.
Our results show significant improvements of the proposed methods over
Naive Bayes with NB(k) leading over NB k-grams in all test cases.
The two proposed methods despite having improved modelling accuracy and
demonstrated improved test accuracy maintain the
\begin_inset Quotes eld
\end_inset
one pass through the data only
\begin_inset Quotes erd
\end_inset
training property of Naive Bayes thus yielding fairly accurate and efficient
classifiers.
\layout Section
Introduction
\layout Standard
Sequence classification is a task that appears in many natural scenarios
such as protein sequence function prediction, intrusion detection, text
classification, etc.
\layout Standard
A simple solution that offers a quick and practical approach to this problem
is Naive Bayes.
Naive Bayes assumes that the elements in the sequence are independent of
each other given the class.
This independence assumption however seldom holds in practice.
Furthermore most of the time the dependencies are very strong among close
neighbours in the sequence.
In order to address this concern we would like to be able to relax the
strong independence assumption of Naive Bayes and allow for a limited amount
of interaction among close neighbours, and thus get better modelling accuracy.
Despite the additional sophistication we would like to maintain the
\begin_inset Quotes eld
\end_inset
one pass through the data only
\begin_inset Quotes erd
\end_inset
training property of Naive Bayes.
Note that since we are preoccupied with sequence classification whenever
we talk about Naive Bayes we actually refer to the so called Naive Bayes
Multinomial which is more appropriate for our case (sequence classification)
as opposed to the Multivariate model (see
\begin_inset LatexCommand \cite{mccallum98comparison}
\end_inset
for a discussion about it).
\layout Standard
We are going to provide two methods to deal with this problem.
The first one basically splits the sequence into a sequence of overlapping
k-grams and runs a Naive Bayes algorithm in the transformed space.
We call the resulting algorithm NB k-grams.
This method however violates the assumption of independence that Naive
Bayes holds in an obvious way: Since the k-grams are overlapping, every
two subsequent k-grams share k-1 elements, which makes them evidently non-indep
endent.
\layout Standard
In the second method, in order to deal with the above mentioned problem
properly we construct an undirected graphical model for the k-grams
\begin_inset LatexCommand \cite{charniak93statistical}
\end_inset
, which will adequately discount the contributions of the overlaps.
We show the equivalence of this undirected model with the Markov Model
of order k-1.
This contributes to dispelling a misconception that undirected models (i.e.,
random fields) are more suitable than directed models such as Markov Models
for modelling protein sequences.
This view may be held based on the fact that the order in the protein sequence
has a spatial nature, which is better modelled by undirected models (random
fields), rather than a temporal nature as it is the case with Markov Models.
While we do not dispute the spatial nature of the oreder in protein sequences,
we show that both the undirected (spatial) and the directed models (temporal
= Markov Models) associated with with k-grams yield the same probability
distribution.
\layout Standard
In order to use our our undirected models for classification, we apply the
basic trick of turning a generative model into a predictor: We train one
generative model per class and then when asked to predict a class for a
newly seen sequence we output the one whose corresponding generative model
outputs the highest probability for the given sequence, weighted with the
probability of the class.
We call this model NB(k) from Naive Bayes order k (which is when we model
dependencies among k attributes).
Note that Naive Bayes will turn out to be NB(1) (we model dependencies
among 1 attribute, that is no dependencies, i.e., the independence)
\layout Standard
We put to test our methods on three protein function prediction problems
with functional classes extracted from GO (the Gene Ontology
\begin_inset LatexCommand \cite{gene00gene}
\end_inset
) and corresponding sequences extracted from SWISSPROT
\begin_inset LatexCommand \cite{bairoch00swiss-prot}
\end_inset
.
In our experiments NB k-grams outperformed Naive Bayes on all the tests
and NB(k) outperformed NB k-grams on all tests too.
\layout Standard
To recapitulate, we will present two methods that relax the independence
assumption of Naive Bayes by modelling interaction among close neighbours
in the sequence.
The two methods preserve the
\begin_inset Quotes eld
\end_inset
one pass through the data
\begin_inset Quotes erd
\end_inset
training property of the Naive Bayes and they are also also easily update-able.
Additionally our second method - NB(k) which is not incidentally, also
our best performer, represents and end in itself since it is the most complete
undirected graphical model for k-grams modelling.
We further explore the equivalence of this undirected model to the (directed)
Markov Model of order k-1 thus clearing a misconception about the presumed
superiority of undirected graphical models vs.
the directed ones in terms of modelling sequences where the order is spatial
as opposed to temporal.
\layout Standard
The rest of the paper will proceed as follows: First, in the next section
we will have an in depth exploration of the intuition and theory behind
the two methods to be explored.
Then, we put them to test in section three and produce some experimental
results.
Finally, in section four we conclude with a summary and discussion.
\layout Section
Method
\layout Standard
In this section we will preoccupy ourselves with unraveling the details
of the new models.
As a basic strategy we will first start from basic principles and try to
develop some intuition behind the new methods followed by more precise
and general formulations, which will be furthermore supplemented by theoretical
results.
\layout Standard
We start in the first subsection with an outline of the Naive Bayes Classifier
and the general method for producing classifiers from generative models.
Then we examine a way to produce undirected probabilistic models that take
into account correlations among proximal elements within a sequence.
Subsequently, we outline the associated directed graphical models and comment
on the relation with the directed ones.
We then briefly present the parameter fitting method.
\layout Subsection
Naive Bayes and the General Method for producing classifiers from generative
models.
\layout Standard
Before plunging into the details, a little bit of
\series bold
Notation
\series default
: We will denote a whole sequence with
\begin_inset Formula $\overline{S}$
\end_inset
, the i-th element of the sequence by
\begin_inset Formula $S_{i}$
\end_inset
and the value of the i-th element in the sequence by
\begin_inset Formula $s_{i}$
\end_inset
, furthermore let
\begin_inset Formula $V$
\end_inset
be the set of all possible values that the elements in the sequence can
take.
Let
\begin_inset Formula C
\end_inset
be the class variable and let
\begin_inset Formula $\{ c_{j}\}_{j=1,m}$
\end_inset
be the possible values that C can take.
And let
\begin_inset Formula $n$
\end_inset
be the length of the current sequence i.e.,
\begin_inset Formula $\overline{S}=[S_{1}=s_{1},...,S_{n}=s_{n}]$
\end_inset
.
\layout Standard
The Naive Bayes algorithm for sequence classification assumes that the successiv
e elements in a given sequence are independent of each other given the class.
\layout Standard
One way to interpret Naive Bayes is as a generative model.
A generative model produces the observed data by the means of a probabilistic
generation process.
The generation process for Naive Bayes is very simple:
\layout Standard
\begin_inset Float algorithm
placement H
wide false
collapsed false
\layout Standard
1.
First generate a class
\begin_inset Formula $c_{j}$
\end_inset
according to the probability P(C)
\layout Standard
2.
Then for i = 1 to n
\layout Standard
\begin_inset Formula $\qquad$
\end_inset
generate
\begin_inset Formula $s_{i}$
\end_inset
according to P(
\begin_inset Formula $S_{i}$
\end_inset
|
\begin_inset Formula $C=c_{j}$
\end_inset
).
\layout Standard
\SpecialChar ~
\layout Standard
Note: Since we also assume stationarity P(
\begin_inset Formula $S_{i}$
\end_inset
|
\begin_inset Formula $C=c_{j}$
\end_inset
) = P(
\begin_inset Formula $S_{l}$
\end_inset
|
\begin_inset Formula $C=c_{j}$
\end_inset
) := P(
\begin_inset Formula $S$
\end_inset
|
\begin_inset Formula $C=c_{j}$
\end_inset
) for every
\begin_inset Formula $i,l\in\{1,...,n\}$
\end_inset
and
\begin_inset Formula $c_{j}\in C$
\end_inset
\layout Caption
\begin_inset LatexCommand \label{cap:Naive-Bayes-Sequence}
\end_inset
Naive Bayes Sequence Generative Model
\end_inset
\layout Standard
Note that in step 2 we are generating each element in the sequence independent
of the previous ones or the ones that are going to come.
The generative model implemented by the Algorithm
\begin_inset LatexCommand \ref{cap:Naive-Bayes-Sequence}
\end_inset
can be depicted as a Bayesian Network as shown in Figure
\begin_inset LatexCommand \ref{cap:Naive-Bayes-Bayesian}
\end_inset
.
In this figure the arrows from the class variable to the elements of the
sequence show the dependencies in the generative process.
\layout Standard
\begin_inset Float figure
wide false
collapsed false
\layout Standard
\begin_inset Graphics
filename NB.eps
\end_inset
\layout Caption
\begin_inset LatexCommand \label{cap:Naive-Bayes-Bayesian}
\end_inset
Naive Bayes Bayesian Network
\end_inset
\layout Standard
The Naive Bayes classifier is a particular example of a general scheme of
producing a classifier for a given generative model for the input data
(sequence in our case).
We will try to outline it in what follows with the purpose of using it
with other more complicated generative models.
\layout Standard
Let us assume we have a probabilistic generative model for sequences
\begin_inset Formula $GM$
\end_inset
.
That is a probabilistic model which given a sequence
\begin_inset Formula $\overline{S}=s_{1},...,s_{n}$
\end_inset
can give us the probability
\begin_inset Formula $P_{GM}(\overline{S}=s_{1},...,s_{n})$
\end_inset
according to the generative model
\series bold
\emph on
\begin_inset Formula $GM$
\end_inset
\series default
\emph default
.
Then if we want to obtain a classifier using the generative model
\begin_inset Formula $GM$
\end_inset
we can use the following (standard) procedure:
\layout Standard
1.
For each class
\begin_inset Formula $c_{j}$
\end_inset
train a generative model
\begin_inset Formula $GM(c_{j})$
\end_inset
using the data from class
\begin_inset Formula $c_{j}$
\end_inset
.
\layout Standard
2.
Then when asked to classify a new sequence
\begin_inset Formula $\overline{S}=s_{1},...,s_{n}$
\end_inset
predict it as belonging to class:
\layout Standard
\begin_inset Formula \[
classification=\arg\max_{c_{j}\in C}\: P_{GM(c_{j})}(\overline{S}=s_{1},...,s_{n})\: P(c_{j})\]
\end_inset
\layout Standard
or equivalently:
\layout Standard
\begin_inset Formula \[
classification=\arg\max_{c_{j}\in C}\: P(\overline{S}=s_{1},...,s_{n}|GM(c_{j}))\: P(c_{j})\]
\end_inset
\layout Standard
or equivalently if, by abuse of notation,
\begin_inset Formula $P(\overline{S}=s_{1},...,s_{n}|c_{j})$
\end_inset
stands for
\begin_inset Formula $P(\overline{S}=s_{1},...,s_{n}|GM(c_{j}))$
\end_inset
we will have:
\layout Standard
\begin_inset Formula \[
classification=\arg\max_{c_{j}\in C}\: P(\overline{S}=s_{1},...,s_{n}|c_{j})\: P(c_{j})\]
\end_inset
\layout Standard
which in the case of the Naive Bayes generative model is the well known:
\layout Standard
\begin_inset Formula \[
classification=\arg\max_{c_{j}\in C}\: P(S_{i}=s_{i}|c_{j})\: P(c_{j})\]
\end_inset
\layout Standard
because under the independence assumption given the class we have:
\begin_inset Formula \[
P(\overline{S}=s_{1},...,s_{n}|c_{j})=\prod_{i=1}^{n}P(S_{i}=s_{i}|c_{j})\]
\end_inset
\layout Standard
Note that in the case of Naive Bayes the generative model for the input
data is assumed to be a set of independent samples drawn from the same
multinomial distribution
\begin_inset Formula $P(S_{i}=s_{i}|c_{j})$
\end_inset
(so the joint
\begin_inset Formula $P(S=s_{1},...,s_{n}|c_{j})$
\end_inset
can be obtained by taking the product).
Furthermore under the stationarity assumption the probability distribution
does not depends on the position
\begin_inset Formula $i$
\end_inset
(i.e., P(
\begin_inset Formula $S_{i}$
\end_inset
|
\begin_inset Formula $C=c_{j}$
\end_inset
) = P(
\begin_inset Formula $S_{l}$
\end_inset
|
\begin_inset Formula $C=c_{j}$
\end_inset
) = P(
\begin_inset Formula $S$
\end_inset
|
\begin_inset Formula $C=c_{j}$
\end_inset
) for every
\begin_inset Formula $i,l\in\{1,...,n\}$
\end_inset
and
\begin_inset Formula $c_{j}\in C$
\end_inset
).
\layout Standard
We are interested however in capturing more interactions among the elements
in the sequence, as they are seldom independent and as such, we will need
a more complicated model for the data.
Given that this model is established, we can then produce a classifier
using the scheme outlined above.
So from now on we will forget about classes and we will only be concerned
with finding probabilistic models that take into account dependencies among
elements in the sequence.
And then whenever we shall be asked produce classifications with our derived
a probabilistic model
\begin_inset Formula $PM$
\end_inset
we will apply the schema outlined in this section by replacing in the formulas
\begin_inset Formula $P_{GM(c_{j})}(...)$
\end_inset
with
\begin_inset Formula $P_{PM(c_{j})}(...)$
\end_inset
.
\layout Standard
Now that we have outlined the general scheme for producing classifiers from
Probabilistic Models we reduced our problem to the one of finding some
appropriate probabilistic models that capture correlations among close
elements within a sequence.
Upon finding such models we can turn them into a classifier using the scheme
outlined above.
Hence, classification being set aside, we are ready to launch into the
exploration of dependency models among sequence elements.
\layout Subsection
Models for direct interaction among k consecutive elements
\layout Standard
In graphical depiction we show the interaction model for Naive Bayes in
Figure
\begin_inset LatexCommand \ref{cap:Graphical-depiction-of}
\end_inset
a) for a sequence of five elements.
What we would like now to model is dependencies among two or more attributes.
One way to represent these dependencies in a graphical form is by drawing
edges between the nodes that are supposed to be directly dependent on each
other.
The graph for pairwise dependencies is illustrated in Figure
\begin_inset LatexCommand \ref{cap:Graphical-depiction-of}
\end_inset
b) and the one for 3-wise dependency is depicted in Figure
\begin_inset LatexCommand \ref{cap:Graphical-depiction-of}
\end_inset
c).
\layout Standard
\begin_inset Float figure
placement t
wide false
collapsed false
\layout Standard
\begin_inset Graphics
filename sNBk.eps
\end_inset
\layout Caption
\begin_inset LatexCommand \label{cap:Graphical-depiction-of}
\end_inset
Graphical depiction of the dependence between the elements in a sequence
of five elements using Undirected Graphical Models: a) shows the Naive
Bayes [= independence model]; b) shows pairwise dependence and c) shows
3-wise dependence.
\end_inset
\layout Standard
The models show in Fugure
\begin_inset LatexCommand \ref{cap:Graphical-depiction-of}
\end_inset
are called Probabilistic Undirected Graphical Models / Markov Networks
/ Random Fields.
The arcs in the graphs denote the direct dependencies.
By complementarity the absence of an arc between two nodes denotes the
absence of a direct dependency.
This does not necessarily mean independence, and most often than not, it
isn't independence.
Two nodes are going to be indirectly dependent if they are going to be
connected in the graph by a path containing more than one edge.
The absence of an arc between two nodes means that there may be situations
where we if we know some of the values of nodes on paths connecting our
two nodes, this might yield the two nodes independent.
More exactly in order to get independence between two indirectly connected
nodes (connected by a path of length grater that one) we have to know the
value of at least one node on each of the paths between these two nodes.
The knowledge of the values of these nodes is going to
\begin_inset Quotes eld
\end_inset
block
\begin_inset Quotes erd
\end_inset
the indirect dependence between our two unconnected nodes, thus yielding
them independent.
\layout Standard
Now that we have explained the intuition behind the graphical representation
we are going to concern ourselves with a way of capitalising on this intuition
and try to derive a formula for the probability of a data-point (sequence
in our case).
That will allow us to provide
\begin_inset Formula $P_{PM}(\overline{S})$
\end_inset
which is all that is needed in order for our classifcation scheme to work.
In the independence assuming case ( Figure
\begin_inset LatexCommand \ref{cap:Graphical-depiction-of}
\end_inset
a) ) the formula for our 5 elements of the sequence is as follows:
\begin_inset Formula \[
P(\overline{S}=[S_{1}=s_{1},...,S_{5}=s_{5}])=\prod_{i=1}^{5}P(S_{i}=s_{i})\]
\end_inset
\layout Standard
and in general for a sequence of
\begin_inset Formula $n$
\end_inset
elements we have:
\layout Standard
\begin_inset Formula \[
P(\overline{S}=[S_{1}=s_{1},...,S_{n}=s_{n}])=\prod_{i=1}^{n}P(S_{i}=s_{i})\]
\end_inset
\layout Subsubsection
\begin_inset LatexCommand \label{sub:UNPkgP-&-NB}
\end_inset
UNPkgP & NB k-grams
\layout Standard
One intuitive way to try to define a probability distribution for the case
of the graphical model representing pairwise dependencies ( Figure
\begin_inset LatexCommand \ref{cap:Graphical-depiction-of}
\end_inset
b) ) is to take the product of the marginals of the directly dependent
node as proposed by the following formula ( for our 5 elements sequence
from Figure
\begin_inset LatexCommand \ref{cap:Graphical-depiction-of}
\end_inset
b) ) [Since the nodes are pairwise dependent the information in the pairwise
marginals is presumably important and we cannot derive it from products
of single variable marginals]:
\layout Standard
\begin_inset Formula \[
P(\overline{S}=[S_{1}=s_{1},...,S_{5}=s_{5}])\;?=\prod_{i=1}^{4}P(S_{i}=s_{i},S_{i+1}=s_{i+1})\]
\end_inset
\layout Standard
This will not produce a probability distribution however, and the intuition
behind it is that we are somehow
\begin_inset Quotes eld
\end_inset
double counting
\begin_inset Quotes erd
\end_inset
or more exactly
\begin_inset Quotes eld
\end_inset
double multiplying
\begin_inset Quotes erd
\end_inset
the influence of some of the nodes (such as
\begin_inset Formula $S_{2}=s_{2},S_{3}=s_{3},S_{4}=s_{4}$
\end_inset
;
\begin_inset Formula $S_{2}=s_{2}$
\end_inset
for example appears both in
\begin_inset Formula $P(S_{1}=s_{1},S_{2}=s_{2})$
\end_inset
and
\begin_inset Formula $P(S_{2}=s_{2},S_{3}=s_{3})$
\end_inset
).
\layout Standard
The are two ways to deal with this problem.
The first one is to ignore the double counting and in order to get a probabilit
y distribution just normalise it over all the possible values that the elements
in the sequence can take.
So in general we will have
\layout Standard
\begin_inset Formula \[
P(\overline{S}=[S_{1}=s_{1},...,S_{n}=s_{n}])=\frac{1}{Z}\prod_{i=1}^{n-1}P(S_{i}=s_{i},S_{i+1}=s_{i+1})\]
\end_inset
\layout Standard
Where
\begin_inset Formula $Z$
\end_inset
is given by the following formula:
\layout Standard
\begin_inset Formula \[
Z=\sum_{S=(s_{1},...,s_{n})\in V^{n}}\prod_{i=1}^{n-1}P(S_{i}=s_{i},S_{i+1}=s_{i+1})\]
\end_inset
\layout Standard
We will call this model Probabilistic 2-gram Propositionalisation or shortly
P2gP.
More generally, for the case of k-wise direct dependencies we will call
this model Probabilistic k-gram Propositionalisation or shortly PkgP.
\layout Standard
\begin_inset Formula \[
P(\overline{S}=[S_{1}=s_{1},...,S_{n}=s_{n}])=\frac{1}{Z}\prod_{i=1}^{n-k+1}P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1})\]
\end_inset
\layout Standard
and
\begin_inset Formula $Z$
\end_inset
being given by:
\layout Standard
\begin_inset Formula \[
Z=\sum_{S=s_{1},...,s_{n}\in V^{n}}\prod_{i=1}^{n-k+1}P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1})\]
\end_inset
\layout Standard
Note that the normalisation constant
\begin_inset Formula $Z$
\end_inset
(a.k.a.
partition function) is a very complicated thing to calculate and with the
exception of the cases when we have very small sequences we cannot compute
it.
One possible way to deal with this problem is to just ignore it, or basically
set
\begin_inset Formula $Z$
\end_inset
to1 and use the corresponding results (which are not going to be true probabili
ties) instead of the true probability
\begin_inset Formula $p(\overline{S})$
\end_inset
when doing classification.
We will call this model UnNormalised Probabilistic k-gram Propositionalisation
or in short UNPkgP.
This model turns out, from the point of view of classification to be the
same as transforming the sequence into a bag of overlapping k-grams and
using the Naive Bayes classifier on it.
Here is a more precise statement of the claim and the prove for it.
\begin_inset Note
collapsed true
\layout Standard
(for a proof claim see
\begin_inset LatexCommand \cite{silvescu04inter-element}
\end_inset
).
\end_inset
\layout Standard
\series bold
Claim:
\series default
\emph on
The class conditional (classifier) version of UNPkgP [denoted C-UNPkgP]
yields the the same decision as Naive Bayes on k-grams [denoted NB k-grams].
\layout Standard
\emph on
Proof.
\layout Standard
\begin_inset Formula \[
C-UNPkgP=\arg\max_{c_{j}\in C}P_{C-UNPkgP(c_{j})}(\overline{S}=[S_{1}=s_{1},...,S_{n}=s_{n}])P(c_{j})\]
\end_inset
\begin_inset Formula \begin{equation}
=\arg\max_{c_{j}\in C}\prod_{i=1}^{n-k+1}P_{C-UNPkgP(c_{j})}(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1})P(c_{j})\label{eq:claim1}\end{equation}
\end_inset
\layout Standard
\begin_inset Formula \begin{equation}
=\arg\max_{c_{j}\in C}\prod_{i=1}^{n-k+1}P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1}|c_{j})P(c_{j})\label{eq:claim2}\end{equation}
\end_inset
\layout Standard
\begin_inset Formula \[
=\arg\max_{c_{j}\in C}\prod_{i=1}^{n-k+1}P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1}|c_{j})P(c_{j})=NB\: k-grams\]
\end_inset
Where the transition from equation (
\begin_inset LatexCommand \ref{eq:claim1}
\end_inset
) to equation (
\begin_inset LatexCommand \ref{eq:claim2}
\end_inset
) is due to the fact that
\begin_inset Formula $P_{C-UNPkgP(c_{j})}(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1})$
\end_inset
is estimated in the same way as
\begin_inset Formula $P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1}|c_{j})$
\end_inset
from the training data , namely as:
\layout Standard
\begin_inset Formula \[
P(s_{1},...,s_{k}|c_{j})=\frac{counts(s_{1},...,s_{k},c_{j})}{counts(c_{j})}\]
\end_inset
\layout Standard
(see Section
\begin_inset LatexCommand \ref{sub:Training}
\end_inset
Training ...
for more details)
\begin_inset Formula $\Box$
\end_inset
\layout Standard
As a consequence of this claim we are going to us Naive Bayes applied to
k-grams as our first proposes method which we shoed to be identical in
classification results to turning or UNPkgP probabilistic model into a
classifier.
\layout Subsubsection
\begin_inset LatexCommand \label{sub:NB(k)}
\end_inset
NB(k)
\layout Standard
In a different vein however we can try to fix the problem of
\begin_inset Quotes eld
\end_inset
double multiplying
\begin_inset Quotes erd
\end_inset
by dividing by the probabilities associated with the double-counted nodes.
That is, for our 5 letter sequence from Figure
\begin_inset LatexCommand \ref{cap:Graphical-depiction-of}
\end_inset
b):
\begin_inset Formula \[
P(\overline{S}=[S_{1}=s_{1},...,S_{5}=s_{5}])\;?=\frac{\prod_{i=1}^{4}P(S_{i}=s_{i},S_{i+1}=s_{i+1})}{\prod_{i=2}^{4}P(S_{i}=s_{i})}\]
\end_inset
\layout Standard
It turns out that this is the right thing to do and what we get is a full
blown probability distribution which is the
\begin_inset Quotes eld
\end_inset
correct
\begin_inset Quotes erd
\end_inset
one.
The generalisation for
\begin_inset Formula $k>2$
\end_inset
is as follows:
\layout Standard
\begin_inset Formula \begin{equation}
P(\overline{S}=[S_{1}=s_{1},...,S_{n}=s_{n}])\;=\frac{\prod_{i=1}^{n-k+1}P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1})}{\prod_{i=2}^{n-k+1}P(S_{i}=s_{i},...,S_{i+k-2}=s_{i+k-2})}\label{eq:nbkp}\end{equation}
\end_inset
\layout Standard
The obtained probability model is
\begin_inset Quotes eld
\end_inset
correct
\begin_inset Quotes erd
\end_inset
in the following sense: It is the same probability model that would be
obtained from considering a Markov Network / Random Field over the sequence,
whose graph is constructed by drawing direct arcs between all k consecutive
elements of the sequence.
Additionally the parameters for the marginal probabilities of each k consecutiv
e elements (the maximal cliques of these graphs) are shared for all positions.
We are producing in this case a position independent model (or equivalently
the probability distribution associated with it is stationary).
More exactly if we model dependence among k attributes we will call the
model NB(k) (from Naive Bayes of k).
Note that in this case NB(1) is really the independence case which yields
Naive Bayes.
\layout Standard
The fact that the probability distribution obtained in the previous formula
is correct is supported by the Junction Tree Theorem (see
\begin_inset LatexCommand \cite{cowell99probabilistic}
\end_inset
):
\layout Standard
\series bold
Theorem
\series default
\emph on
(Junction Tree): Let the Markov Network for a set of random variables be
described by a chordal graph G, then the probability distribution associated
with the Markov network can be written as the product of the marginals
of the maximal cliques divided by the product of the marginals of the separator
s.
i.e.,
\layout Standard
\begin_inset Formula \[
P(\overline{X}=[X_{1}=v_{1},...,X=v_{n}])\;=\frac{\prod_{c\in MaxCliques}P(X_{c}=x_{c})}{\prod_{s\in Separators}P(X_{s}=x_{s})}\]
\end_inset
\layout Standard
Before proceeding with a proof sketch a few potentially useful
\series bold
Definitions:
\layout Standard
Definition: (
\series bold
\emph on
elementary cycle
\series default
\emph default
) A cycle is called elementary if it contaitn no sub-cycles.
\layout Standard
Definition:(
\series bold
\emph on
chordal graph
\series default
\emph default
) A graph G is chordal iff it has no elementary cycles of length bigger
that 3.
\layout Standard
Definition:
\series bold
\emph on
(clique
\series default
\emph default
) A clique is a set of nodes from a graph G such that the is an edge between
every pair of nodes in the clique.
\layout Standard
Definition:(
\series bold
\emph on
maximal clique
\series default
\emph default
): A clique c from a graph G is called maximal clique if any superset of
nodes that contains the cliques c is not a clique.
\layout Standard
Definition:(
\series bold
\emph on
separator
\series default
\emph default
) Let
\begin_inset Formula $c_{1}$
\end_inset
and
\begin_inset Formula $c_{2}$
\end_inset
two cliques from a graph G.
The separator of these two cliques
\begin_inset Formula $c_{1}$
\end_inset
and
\begin_inset Formula $c_{2}$
\end_inset
is
\begin_inset Formula $c_{1}\cap c_{2}$
\end_inset
.
\layout Standard
\series bold
Remark
\series default
: It is obvious that the sequence graphs that we are dealing with are chordal
and the application of the theorem will give us our previously derived
formula i.e., equation (
\begin_inset LatexCommand \ref{eq:nbkp}
\end_inset
).
\layout Standard
\emph on
Proof Idea
\emph default
.
We will give a proof of the theorem for the sequence graphs that we are
dealing with and refer the reader to
\begin_inset LatexCommand \cite{cowell99probabilistic}
\end_inset
for the proof in the case of general chordal graphs.
\layout Standard
Intuitive example: For our sequence of 5 elements the probability of a sequence
is as follows:
\layout Standard
\begin_inset Formula \[
P(\overline{S}=[S_{1}=s_{1},...,S_{5}=s_{5}])=\frac{\prod_{i=1}^{4}P(S_{i}=s_{i},S_{i+1}=s_{i+1})}{\prod_{i=2}^{4}P(S_{i}=s_{i})}\]
\end_inset
\layout Standard
\begin_inset Formula \[
=\frac{P(S_{1}=s_{1},S_{2}=s_{2})P(S_{2}=s_{2},S_{3}=s_{3})P(S_{3}=s_{3},S_{4}=s_{4})P(S_{4}=s_{4},S_{5}=s_{5})}{P(S_{2}=s_{2})P(S_{3}=s_{3})P(S_{4}=s_{4})}\]
\end_inset
\layout Standard
where if we write
\begin_inset Formula $P(S_{1}=s_{1},S_{2}=s_{2})=P(S_{1}=s_{1})P(S_{2}=s_{2}|S_{1}=s_{1})$
\end_inset
and if we also group together the i+1-th term from the numerator with the
i-th term from the denominator into a conditional.
For example for the 2-nd term from numerator and the 1-st term from the
denominator we have:
\begin_inset Formula $P(S_{2}=s_{2},S_{3}=s_{3})/P(S_{2}=s_{2})=P(S_{3}=s_{3}|S_{2}=s_{2})$
\end_inset
we get:
\layout Standard
\begin_inset Formula \[
P(\overline{S}=[S_{1}=s_{1},...,S_{5}=s_{5}])=\]
\end_inset
\layout Standard
\begin_inset Formula \[
P(S_{1}=s_{1})P(S_{2}=s_{2}|S_{1}=s_{1})P(S_{3}=s_{3}|S_{2}=s_{2})P(S_{4}=s_{4}|S_{3}=s_{3})P(S_{5}=s_{5}|S_{4}=s_{4})\]
\end_inset
\layout Standard
which is
\begin_inset Formula $P(\overline{S}=[S_{1}=s_{1},...,S_{5}=s_{5}])$
\end_inset
given the independences represented by the graphical model.
\layout Standard
Similarly we can show that for k-wise dependencies:
\layout Standard
\begin_inset Formula \begin{equation}
P(\overline{S}=[S_{1}=s_{1},...,S_{n}=s_{n}])\;=\frac{\prod_{i=1}^{n-k+1}P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1})}{\prod_{i=2}^{n-k+1}P(S_{i}=s_{i},...,S_{i+k-2}=s_{i+k-2})}\label{eq:pnbk}\end{equation}
\end_inset
\layout Standard
\begin_inset Formula \begin{equation}
=P(S_{1}=s_{1},...,S_{k-1}=s_{k-1})\prod_{i=k}^{n}P(S_{i}=s_{i}|S_{i-1}=s_{i-1},...,S_{i-k+1}=s_{i-k+1})\label{eq:pmmk}\end{equation}
\end_inset
\layout Standard
Which proves the theorem for sequence graphs.
\begin_inset Formula $\Box$
\end_inset
\layout Standard
Note that the previous formula is identical to the probability that a Markov(k-1
) model will assign.
Which brings us to the next Section.
\layout Subsubsection
Directed Graphical Model Interpretation of NB(k) and the Random Fields vs.
Markov models debate.
\layout Standard
If we are to construct a directed graphical model instead of an undirected
one we will get, for our sequence of length 5, the graphs depicted in Figure
\begin_inset LatexCommand \ref{cap:Bayesian-Network-(Markov}
\end_inset
a) b) c) for independence, pairwise and 3-wise dependency respectively.
These are graphs representing the Markov(k-1) models in the general case
(see
\begin_inset LatexCommand \cite{charniak93statistical,ewens01statistical}
\end_inset
).
(Note that the Markov(k-1) models dependency among k elements: the previous
k-1 and current one, hence the correspondence with NB(k) and not with NB(k-1))
\layout Standard
The probability distributions represented by the undirected graphical models
are the same as the one of the directed one as mentioned in the Proof Idea
of the Junction Tree Theorem.
More exactly we have shown that
\begin_inset Formula $P(\overline{S}=[S_{1}=s_{1},...,S_{n}=s_{n}])$
\end_inset
can be written either as in equation (
\begin_inset LatexCommand \ref{eq:pnbk}
\end_inset
) which is the formula for the undirected model or as in equation (
\begin_inset LatexCommand \ref{eq:pmmk}
\end_inset
) which is the formula for the directed / Bayesian network / Markov(k-1)
model.
\layout Standard
\begin_inset Float figure
wide false
collapsed false
\layout Standard
\begin_inset Graphics
filename sNBkBN.eps
\end_inset
\layout Caption
\begin_inset LatexCommand \label{cap:Bayesian-Network-(Markov}
\end_inset
Bayesian Network (Markov Model) representation of the k-wise dependency.
a) shows the Naive Bayes = independence model; b) shows pairwise dependence
and c) shows 3-wise dependence
\end_inset
\layout Standard
We may think that Markov Models, since they are time oriented probabilistic
models, might be less suited for modelling sequences that do not have this
temporal dimension but where the dimension along which the sequence spans
of is rather spatial than temporal nature (as in the case of protein sequences
vs.
speech utterances for example).
As a consequence a need for using undirected graphical models may be felt.
We have proved however [see Proof Idea] that the Markov(k-1) Models (temporally
oriented) and the Markov Networks / Random Fields (undirected - spatially
based) / NB(k) models produce the same probability distribution.
Hence, which type of model to choose becomes more of a matter of preference
or taste rather than one of choosing between models that yield different
probabilities.
\layout Subsection
Training, i.e., Probability Estimation from Data
\begin_inset LatexCommand \label{sub:Training}
\end_inset
\layout Standard
In order to train our models we need to get estimates from the probabilities
\begin_inset Formula $P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1})$
\end_inset
for each possible class value
\begin_inset Formula $c_{j}$
\end_inset
.
Because of the stationarity assumption these probabilities are going to
be the same for each
\begin_inset Formula $i$
\end_inset
so
\begin_inset Formula $P(S_{1}=s_{1},...,S_{k}=s_{k}|c_{j})=P(S_{i}=s_{i},...,S_{i+k-1}=s_{i+k-1}|c_{j})$
\end_inset
for all values of
\begin_inset Formula $i$
\end_inset
.
We will denote their common value for a set of elements
\begin_inset Formula $s_{1},...,s_{k}$
\end_inset
and a class
\begin_inset Formula $c_{j}$
\end_inset
by notation abuse with
\begin_inset Formula $P(s_{1},...,s_{k}|c_{j})$
\end_inset
.
We will estimate this quantity in the standard way as follows:
\begin_inset Formula \[
P(s_{1},...,s_{k}|c_{j})=\frac{counts(s_{1},...,s_{k},c_{j})}{counts(c_{j})}\]
\end_inset
Where
\begin_inset Formula $counts(s_{1},...,s_{k},c_{j})$
\end_inset
are the number of times the elements
\begin_inset Formula $s_{1},...,s_{k}$
\end_inset
appear in a sequence of belonging to class
\begin_inset Formula $c_{j}$
\end_inset
at any position and
\begin_inset Formula $counts(c_{j})$
\end_inset
is how many such k-grams exist within sequences from class
\begin_inset Formula $c_{j}$
\end_inset
.
\layout Standard
In the case of directed models the estimation can be done in a similar way
namely by the following formula:
\layout Standard
\begin_inset Formula \[
P(s_{1}|s_{2},...,s_{k},c_{j})=\frac{counts(s_{1},...,s_{k},c_{j})}{counts(s_{2},...,s_{k},c_{j})}\]
\end_inset
\layout Standard
Now that we have showed how to fit the parameters of the model we are ready
to test it experimentally.
\layout Standard
Furthermore in order to avoid zero probabilities and to get a mild regularisatio
n effect we used the Laplace correction for the previous probability estimates
(Which basically initialises the counts with 1 instead of 0 and adds to
the denominator an appropriate count in order to ensure that the probabilities
sum up to 1) .
\layout Section
Experimental setup and results
\layout Standard
\begin_inset Note
collapsed true
\layout Standard
We will test the two classification algorithms associated with the probabilistic
models derived in Sections
\begin_inset LatexCommand \ref{sub:UNPkgP-&-NB}
\end_inset
and
\begin_inset LatexCommand \ref{sub:NB(k)}
\end_inset
against Naive Bayes.
More precisely we will test Naive Bayes, NB k-grams and NB(k) on three
protein function classification datasets.
\layout Standard
\series bold
Datasets:
\series default
The three Datasets are as follows: The first one Kinase, contains 5 classes
of Kinases amounting to a total of 396 proteins with the following per
class count (102, 209, 69, 10, 6); The second one Kinase/Isomerase aims
to discriminate between Kinases and Ligases (i.e., a two class problem) and
contains a total of 376 protein sequences with a per class count of (158,
218); The third dataset Kinase/Ligase/Helicase/Isomerase contains four
classes of proteins: Kinases, Ligases, Helicases and Isomerases respectively,
amounting to 572 proteins with a per class count of (158, 218, 110, 86).
All the protein functions have been Extracted from GO (the Gene Ontology,
\begin_inset LatexCommand \cite{gene00gene}
\end_inset
) and the corresponding protein sequences were extracted from SWISSPROT
\begin_inset LatexCommand \cite{bairoch00swiss-prot}
\end_inset
.
The datasets are available for download at
\begin_inset LatexCommand \cite{andorf04pcdatasets}
\end_inset
.
\layout Standard
\series bold
Experiments and Results:
\series default
The computational experiments were motivated by the following questions:
\layout Standard
How do the two presented methods NB k-grams and NB(k) compare among each
other and against the baseline represented by Naive Bayes(i.e., NB(1), a.k.a.
NB 1-grams)?
\layout Standard
Furthermore we inquire upon the effect that the number of dependencies,
k, has on the classification accuracy of the dependency ?
\layout Standard
We ran each algorithm with the size of dependency k sizes ranging from 1
to 4 (except for Naive Bayes, of course).
For sizes of k larger than 4, e.g., for 5 we need to calculate over three
million probabilities (=
\begin_inset Formula $20^{5}$
\end_inset
) per class the size of the data doe not allow us to reliably estimate these
probabilities.
The accuracy estimates in all the tables are based on stratified 10-fold
cross validation.
\layout Standard
\begin_inset Float table
wide false
collapsed false
\layout Standard
\begin_inset Tabular
|
\begin_inset Text
\layout Standard
\SpecialChar ~
k
\end_inset
|
\begin_inset Text
\layout Standard
NB
\end_inset
|
\begin_inset Text
\layout Standard
NB k-grams
\end_inset
|
\begin_inset Text
\layout Standard
NB(k) [<=>MM(k-1)]
\end_inset
|
\begin_inset Text
\layout Standard
Improvement
\end_inset
|
|
\begin_inset Text
\layout Standard
1
\end_inset
|
\begin_inset Text
\layout Standard
66.1
\end_inset
|
\begin_inset Text
\layout Standard
66.1
\end_inset
|
\begin_inset Text
\layout Standard
66.1
\end_inset
|
\begin_inset Text
\layout Standard
0.0
\end_inset
|
|
\begin_inset Text
\layout Standard
2
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
81.3
\end_inset
|
\begin_inset Text
\layout Standard
88.6
\end_inset
|
\begin_inset Text
\layout Standard
7.3
\end_inset
|
|
\begin_inset Text
\layout Standard
3
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
89.9
\end_inset
|
\begin_inset Text
\layout Standard
92.7
\end_inset
|
\begin_inset Text
\layout Standard
2.8
\end_inset
|
|
\begin_inset Text
\layout Standard
4
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
90.4
\end_inset
|
\begin_inset Text
\layout Standard
91.6
\end_inset
|
\begin_inset Text
\layout Standard
1.6
\end_inset
|
\end_inset
\layout Caption
\begin_inset LatexCommand \label{cap:Kinase}
\end_inset
Kinase dataset results.
\end_inset
\layout Standard
\begin_inset Float table
wide false
collapsed false
\layout Standard
\begin_inset Tabular
|
\begin_inset Text
\layout Standard
\SpecialChar ~
k
\end_inset
|
\begin_inset Text
\layout Standard
NB
\end_inset
|
\begin_inset Text
\layout Standard
NB on k-grams
\end_inset
|
\begin_inset Text
\layout Standard
NB(k) [<=>MM(k-1)]
\end_inset
|
\begin_inset Text
\layout Standard
Improvement
\end_inset
|
|
\begin_inset Text
\layout Standard
1
\end_inset
|
\begin_inset Text
\layout Standard
77.9
\end_inset
|
\begin_inset Text
\layout Standard
77.9
\end_inset
|
\begin_inset Text
\layout Standard
77.9
\end_inset
|
\begin_inset Text
\layout Standard
0.0
\end_inset
|
|
\begin_inset Text
\layout Standard
2
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
83.5
\end_inset
|
\begin_inset Text
\layout Standard
84.6
\end_inset
|
\begin_inset Text
\layout Standard
1.1
\end_inset
|
|
\begin_inset Text
\layout Standard
3
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
84.0
\end_inset
|
\begin_inset Text
\layout Standard
85.6
\end_inset
|
\begin_inset Text
\layout Standard
1.6
\end_inset
|
|
\begin_inset Text
\layout Standard
4
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
85.9
\end_inset
|
\begin_inset Text
\layout Standard
90.7
\end_inset
|
\begin_inset Text
\layout Standard
4.8
\end_inset
|
\end_inset
\layout Caption
\begin_inset LatexCommand \label{cap:K_L}
\end_inset
Kinase / Ligase dataset results.
\end_inset
\layout Standard
\begin_inset Float table
wide false
collapsed false
\layout Standard
\begin_inset Tabular
|
\begin_inset Text
\layout Standard
\series bold
\SpecialChar ~
\series default
k
\end_inset
|
\begin_inset Text
\layout Standard
NB
\end_inset
|
\begin_inset Text
\layout Standard
NB on [k+1]-grams
\end_inset
|
\begin_inset Text
\layout Standard
NB(k) [<=>MM(k-1)]
\end_inset
|
\begin_inset Text
\layout Standard
Improvement
\end_inset
|
|
\begin_inset Text
\layout Standard
1
\end_inset
|
\begin_inset Text
\layout Standard
56.1
\end_inset
|
\begin_inset Text
\layout Standard
56.1
\end_inset
|
\begin_inset Text
\layout Standard
56.1
\end_inset
|
\begin_inset Text
\layout Standard
0.0
\end_inset
|
|
\begin_inset Text
\layout Standard
2
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
70.3
\end_inset
|
\begin_inset Text
\layout Standard
72.2
\end_inset
|
\begin_inset Text
\layout Standard
1.9
\end_inset
|
|
\begin_inset Text
\layout Standard
3
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
79.5
\end_inset
|
\begin_inset Text
\layout Standard
80.8
\end_inset
|
\begin_inset Text
\layout Standard
1.3
\end_inset
|
|
\begin_inset Text
\layout Standard
4
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
79.4
\end_inset
|
\begin_inset Text
\layout Standard
82.0
\end_inset
|
\begin_inset Text
\layout Standard
2.6
\end_inset
|
\end_inset
\layout Caption
\begin_inset LatexCommand \label{cap:K_L_H_I}
\end_inset
Kinase / Ligase / Helicase / Isomerase dataset results.
\end_inset
\layout Standard
The experimental results show the superiority of both NB k-grams and NB(k)
over Naive Bayes on all test cases.
Furthermore NB(k) ended up being superior in accuracy to NB k-grams on
all test case.
In the case of the Kinase dataset for NB(4) while still doing better than
NB 4-grams did worse that NB(3).
This result is probably due to an insufficient amount of data available
for the estimation of the probabilities.
\end_inset
\layout Standard
We will test the two classification algorithms associated with the probabilistic
models derived in Sections
\begin_inset LatexCommand \ref{sub:UNPkgP-&-NB}
\end_inset
and
\begin_inset LatexCommand \ref{sub:NB(k)}
\end_inset
against Naive Bayes.
More precisely we will test Naive Bayes, NB k-grams and NB(k) on three
protein function classification datasets.
\layout Subsection
Data Sets
\layout Standard
The first dataset is based on families of yeast and human kinases.
These families were chosen for this study because many of them are well-charact
erized, with known structures and functions.
The data set used in this study consisted of proteins belonging to the
Gene Ontology, functional family GO0004672, Protein Kinase Activity.
We classified them according to the highest level below GO0004672.
This consists of 5 groups.
In GO their labels are GO0004672, Protein Kinase Activity; GO0004674, protein
serine/threonine kinase activity; GO0004713, protein-tyrosine kinase activity;
GO0004712, protein threonine/tyrosine kinase activity; and GO0004716, receptor
signaling protein tyrosine kinase activity.
These families represent all the yeast and human proteins that can be extracted
from SWISSPROT that have the given GO classification.
Since GO is represented as a directed acyclic graph, some proteins may
be represented in multiple classifications.
We filter the data to remove any multi-labeled proteins to guarantee disjoint
classes.
This dataset includes a total of 396 proteins: 102 proteins from GO0004672;
209 proteins from GO0004674, 69 proteins from GO0004713, 10 proteins from
GO0004712, and 6 proteins from GO0004716.
\layout Standard
The second dataset is based on two subfamilies of GO0003824, Catalytic Activity.
This division is at a higher level of GO then the previous dataset.
The data set used in this study consisted of proteins belonging to the
Gene Ontology functional family GO0004672, Protein Kinase Activity and
GO001684, Protein Ligase Activity.
These families represent all the yeast proteins that can be extracted from
SWISSPROT that have the given GO classification..
This data includes a total 376 proteins: 158 proteins from Kinase, and
218 proteins from Ligase.
\layout Standard
The third dataset is a super set of dataset two.
It contains the Kinase data and Ligase data in addition to two other subfamilie
s of GO0003824, Catalytic Activity.
These families are GO0004386, Protein Helicase Activity, and GO0016853,
Protein Isomerase Activity.
This dataset tests the classifiers ability on a larger number of classes
at a high level of GO.
This data includes a total of 572 proteins: 158 proteins from Kinase, 218
proteins from Ligase, 110 proteins from Helicase, and 86 proteins from
Isomerase.
\layout Standard
[
\series bold
Shorter version
\series default
]
\series bold
Datasets:
\series default
The three Datasets are as follows: The first one Kinase, contains 5 classes
of Kinases amounting to a total of 396 proteins with the following per
class count (102, 209, 69, 10, 6); The second one Kinase/Isomerase aims
to discriminate between Kinases and Ligases (i.e., a two class problem) and
contains a total of 376 protein sequences with a per class count of (158,
218); The third dataset Kinase/Ligase/Helicase/Isomerase contains four
classes of proteins: Kinases, Ligases, Helicases and Isomerases respectively,
amounting to 572 proteins with a per class count of (158, 218, 110, 86).
All the protein functions have been Extracted from GO (the Gene Ontology,
\begin_inset LatexCommand \cite{gene00gene}
\end_inset
) and the corresponding protein sequences were extracted from SWISSPROT
\begin_inset LatexCommand \cite{bairoch00swiss-prot}
\end_inset
.
The datasets are available for download at
\begin_inset LatexCommand \cite{andorf04pcdatasets}
\end_inset
.
\layout Subsection
Experiments and Results
\layout Standard
The computational experiments were motivated by the following questions:
\layout Standard
How do the two presented methods NB k-grams and NB(k) compare among each
other and against the baseline represented by Naive Bayes(i.e., NB(1), a.k.a.
NB 1-grams)?
\layout Standard
Furthermore we inquire upon the effect that the number of dependencies,
k, has on the classification accuracy of the dependency ?
\layout Standard
Tables
\begin_inset LatexCommand \ref{cap:Kinase}
\end_inset
,
\begin_inset LatexCommand \ref{cap:K_L}
\end_inset
and
\begin_inset LatexCommand \ref{cap:K_L_H_I}
\end_inset
show the results of our experiments using the three datasets.
For each dataset we ran the three separate algorithms Naive Bayes, (i.e.,
NB(1) a.k.a.
NB 1-grams), Naive Bayes with k-grams, and NB(k).
\layout Standard
We ran each algorithm with the size of dependency k sizes ranging from 1
to 4 (except for Naive Bayes, of course).
For sizes of k larger than 4, e.g., for 5 we need to calculate over three
million probabilities (=
\begin_inset Formula $20^{5}$
\end_inset
) per class; and we do not have enough data to reliably estimate these probabili
ties.
Therefore all of our data gets washed out by prior probabilities given
by the Laplace correction and everything is classified as belonging to
the largest class, the class with the highest probability of occurring.
For our alphabet we chose the standard 20 letter amino acid alphabet (although
using a smaller alphabet will decrease the complexity and therefore the
runtime of these algorithms, it offers no benefit in terms of accuracy,
neither of which is our ultimate goal, so we exclude these results
\begin_inset LatexCommand \cite{andorf02discovering}
\end_inset
.
Note that the reduced alphabets are obtained by grouping some amino acids
under the same symbol and treating the as one type of entity).
As already hinted all of our probability estimates for all methods used
Laplace correction.
The accuracy estimates in all the tables are based on stratified 10-fold
cross validation.
\layout Standard
The tables
\begin_inset LatexCommand \ref{cap:Kinase}
\end_inset
,
\begin_inset LatexCommand \ref{cap:K_L}
\end_inset
and
\begin_inset LatexCommand \ref{cap:K_L_H_I}
\end_inset
show from left to right: the dependency number k (k-wise dependency); the
accuracy for Naive Bayes, the accuracy for Naive Bayes k-grams, the accuracy
for NB(k).,and the improvement in accuracy of NB(k) over NB k-grams
\layout Standard
\begin_inset Float table
wide false
collapsed false
\layout Standard
\begin_inset Tabular
|
\begin_inset Text
\layout Standard
\SpecialChar ~
k
\end_inset
|
\begin_inset Text
\layout Standard
NB
\end_inset
|
\begin_inset Text
\layout Standard
NB k-grams
\end_inset
|
\begin_inset Text
\layout Standard
NB(k) [<=>MM(k-1)]
\end_inset
|
\begin_inset Text
\layout Standard
Improvement
\end_inset
|
|
\begin_inset Text
\layout Standard
1
\end_inset
|
\begin_inset Text
\layout Standard
66.1
\end_inset
|
\begin_inset Text
\layout Standard
66.1
\end_inset
|
\begin_inset Text
\layout Standard
66.1
\end_inset
|
\begin_inset Text
\layout Standard
0.0
\end_inset
|
|
\begin_inset Text
\layout Standard
2
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
81.3
\end_inset
|
\begin_inset Text
\layout Standard
88.6
\end_inset
|
\begin_inset Text
\layout Standard
7.3
\end_inset
|
|
\begin_inset Text
\layout Standard
3
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
89.9
\end_inset
|
\begin_inset Text
\layout Standard
92.7
\end_inset
|
\begin_inset Text
\layout Standard
2.8
\end_inset
|
|
\begin_inset Text
\layout Standard
4
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
90.4
\end_inset
|
\begin_inset Text
\layout Standard
91.6
\end_inset
|
\begin_inset Text
\layout Standard
1.6
\end_inset
|
\end_inset
\layout Caption
\begin_inset LatexCommand \label{cap:Kinase}
\end_inset
Kinase (GO0004672) dataset results.
There were a total of 396 proteins and five classes.
k refers to the number of dependencies used in each algorithm.
NB shows the accuracy of running Naive Bayes on the dataset.
NB k-grams shows the accuracy of running Naive Bayes using k-grams on the
dataset.
NB(k) [<=>MM(k-1)] shows the accuracy of running NB(k) using k dependencies
on the dataset.
\end_inset
.
\layout Standard
[
\series bold
Can Be omitted in a short paper
\series default
] Table
\begin_inset LatexCommand \ref{cap:Kinase}
\end_inset
shows the results from the Kinase dataset.
Using Naive Bayes alone we were able to get a 66% classification rate.
By increasing our value of k to 2 we were able to increase accuracy to
81.3% for NB 2-grams and 88.6% with NB(2).
For NB(2) this was over a 22% improvement over Naive Bayes and over 7%
improvement over NB 2-grams.
By increasing k to 3, we again see an improvement.
NB 3-grams and NB(3) improved to 89.9% and 92.7% respectively.
When we increase k again to 4 we get little improvement on this dataset.
NB 4-grams improved by only 0.5% and NB(4) while still doing better than
NB 4-grams actually decreased in accuracy over in NB(3).
The main reason here is the poor estimates for the probability distribution.
We are estimating
\begin_inset Formula $20^{4}~160,000$
\end_inset
probabilities per class.
And we have 5 classes.
\layout Standard
Table
\begin_inset LatexCommand \ref{cap:K_L}
\end_inset
, Kinase/Ligase dataset and Table
\begin_inset LatexCommand \ref{cap:K_L_H_I}
\end_inset
, Kinase/Ligase/Isomerase/Helicase dataset show similar results.
When increasing k from 2 to 3 we get a dramatic improvement with NB(3)
over NB(2), over 5% in Table
\begin_inset LatexCommand \ref{cap:K_L}
\end_inset
and nearly 15% in Table
\begin_inset LatexCommand \ref{cap:K_L_H_I}
\end_inset
.
As we increase k from 2 to 4 we again see improvement in both datasets.
We can get up to a 90.7% accuracy in Table
\begin_inset LatexCommand \ref{cap:K_L}
\end_inset
and 82% accuracy in Table
\begin_inset LatexCommand \ref{cap:K_L_H_I}
\end_inset
.
Another important observation is that NB(k) always outperformed NB k-grams.
It is worth noting that both of these experiments used the same splits
when doing cross validation and used the same way (Laplace correction)
to estimate the probabilities when building classifiers.
So there is no biased in terms of how these two algorithms were trained.
\layout Standard
\begin_inset Float table
wide false
collapsed false
\layout Standard
\begin_inset Tabular
|
\begin_inset Text
\layout Standard
\SpecialChar ~
k
\end_inset
|
\begin_inset Text
\layout Standard
NB
\end_inset
|
\begin_inset Text
\layout Standard
NB k-grams
\end_inset
|
\begin_inset Text
\layout Standard
NB(k) [<=>MM(k-1)]
\end_inset
|
\begin_inset Text
\layout Standard
Improvement
\end_inset
|
|
\begin_inset Text
\layout Standard
1
\end_inset
|
\begin_inset Text
\layout Standard
77.9
\end_inset
|
\begin_inset Text
\layout Standard
77.9
\end_inset
|
\begin_inset Text
\layout Standard
77.9
\end_inset
|
\begin_inset Text
\layout Standard
0.0
\end_inset
|
|
\begin_inset Text
\layout Standard
2
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
83.5
\end_inset
|
\begin_inset Text
\layout Standard
84.6
\end_inset
|
\begin_inset Text
\layout Standard
1.1
\end_inset
|
|
\begin_inset Text
\layout Standard
3
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
84.0
\end_inset
|
\begin_inset Text
\layout Standard
85.6
\end_inset
|
\begin_inset Text
\layout Standard
1.6
\end_inset
|
|
\begin_inset Text
\layout Standard
4
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
85.9
\end_inset
|
\begin_inset Text
\layout Standard
90.7
\end_inset
|
\begin_inset Text
\layout Standard
4.8
\end_inset
|
\end_inset
\layout Caption
\begin_inset LatexCommand \label{cap:K_L}
\end_inset
Kinase(GO0004672) / Ligase(GO001684) dataset results.
There were a total of 396 proteins and two classes.
K refers to the number of dependencies used in each algorithm.
NB shows the accuracy of running Naive Bayes on the dataset.
NB k-grams shows the accuracy of running Naive Bayes using k-grams on the
dataset.
NB(k) [<=>MM(k-1)] shows the accuracy of running NB(k) using k dependencies
on the dataset.
\end_inset
\layout Standard
\begin_inset Float table
wide false
collapsed false
\layout Standard
\begin_inset Tabular
|
\begin_inset Text
\layout Standard
\series bold
\SpecialChar ~
\series default
k
\end_inset
|
\begin_inset Text
\layout Standard
NB
\end_inset
|
\begin_inset Text
\layout Standard
NB k-grams
\end_inset
|
\begin_inset Text
\layout Standard
NB(k) [<=>MM(k-1)]
\end_inset
|
\begin_inset Text
\layout Standard
Improvement
\end_inset
|
|
\begin_inset Text
\layout Standard
1
\end_inset
|
\begin_inset Text
\layout Standard
56.1
\end_inset
|
\begin_inset Text
\layout Standard
56.1
\end_inset
|
\begin_inset Text
\layout Standard
56.1
\end_inset
|
\begin_inset Text
\layout Standard
0.0
\end_inset
|
|
\begin_inset Text
\layout Standard
2
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
70.3
\end_inset
|
\begin_inset Text
\layout Standard
72.2
\end_inset
|
\begin_inset Text
\layout Standard
1.9
\end_inset
|
|
\begin_inset Text
\layout Standard
3
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
79.5
\end_inset
|
\begin_inset Text
\layout Standard
80.8
\end_inset
|
\begin_inset Text
\layout Standard
1.3
\end_inset
|
|
\begin_inset Text
\layout Standard
4
\end_inset
|
\begin_inset Text
\layout Standard
X
\end_inset
|
\begin_inset Text
\layout Standard
79.4
\end_inset
|
\begin_inset Text
\layout Standard
82.0
\end_inset
|
\begin_inset Text
\layout Standard
2.6
\end_inset
|
\end_inset
\layout Caption
\begin_inset LatexCommand \label{cap:K_L_H_I}
\end_inset
Kinase(GO0004672) / Ligase(GO001684) / Helicase Activity(GO0004386) / Isomerase
(GO0016853) dataset results.
There was a total of 572 proteins and four classes.
K refers to the number of dependencies used in each algorithm.
NB shows the accuracy of running Naive Bayes on the dataset.
NB k-grams shows the accuracy of running Naive Bayes using k-grams on the
dataset.
NB(k) [<=>MM(k-1)] shows the accuracy of running NB(k) using k dependencies
on the dataset.
\end_inset
\layout Standard
[To recapitulate] The experimental results show the superiority of both
NB k-grams and NB(k) over Naive Bayes on all test cases.
Furthermore NB(k) ended up being superior in accuracy to NB k-grams on
all test case.
In the case of the Kinase dataset for NB(4) while still doing better than
NB 4-grams did worse that NB(3).
This result is probably due to an insufficient amount of data available
for the estimation of the probabilities.
\layout Standard
Table
\begin_inset LatexCommand \ref{cap:best-compared}
\end_inset
reports the best k size for each of the two methods.
In this table model size refers to the overall alphabet size used by the
algorithm or equivalently how many probabilities per class we estimate.
It is computed by taking the alphabet size and raising it to the k-th power.
For example an alphabet size of 20, our amino acid alphabet, with a k size
of 4 would have a
\begin_inset Formula $20^{4}$
\end_inset
= 160,000 letter alphabet.
This is also, again the number of probabilities per class needed to be
estimated for each algorithm.
Note that in the case of Naive Bayes we estimate 20 probabilities per class.
Also note that the estimations of these big probabilities are very fast
as long as the table fits into memory (it involves one addition and one
multiplication per entry into the table).
Table
\begin_inset LatexCommand \ref{cap:best-compared}
\end_inset
shows that k equal to 3 or 4 is usually optimal in terms of classification.
\layout Standard
\begin_inset Float table
wide false
collapsed false
\layout Standard
\begin_inset Tabular
|
\begin_inset Text
\layout Standard
\end_inset
|
\begin_inset Text
\layout Standard
NB k-grams
\end_inset
|
\begin_inset Text
\layout Standard
\end_inset
|
\begin_inset Text
\layout Standard
\end_inset
|
\begin_inset Text
\layout Standard
NB(k)
\end_inset
|
\begin_inset Text
\layout Standard
\end_inset
|
\begin_inset Text
\layout Standard
\end_inset
|
\begin_inset Text
\layout Standard
\end_inset
|
|
\begin_inset Text
\layout Standard
DataSet
\end_inset
|
\begin_inset Text
\layout Standard
k
\end_inset
|
\begin_inset Text
\layout Standard
Model Size
\end_inset
|
\begin_inset Text
\layout Standard
Acc.
\end_inset
|
\begin_inset Text
\layout Standard
k
\end_inset
|
\begin_inset Text
\layout Standard
Model Size
\end_inset
|
\begin_inset Text
\layout Standard
Acc.
\end_inset
|
\begin_inset Text
\layout Standard
Improvement
\end_inset
|
|
\begin_inset Text
\layout Standard
Kinase
\end_inset
|
\begin_inset Text
\layout Standard
4
\end_inset
|
\begin_inset Text
\layout Standard
160,000
\end_inset
|
\begin_inset Text
\layout Standard
90.4
\end_inset
|
\begin_inset Text
\layout Standard
3
\end_inset
|
\begin_inset Text
\layout Standard
8000
\end_inset
|
\begin_inset Text
\layout Standard
92.7
\end_inset
|
\begin_inset Text
\layout Standard
2.3
\end_inset
|
|
\begin_inset Text
\layout Standard
K/L
\end_inset
|
\begin_inset Text
\layout Standard
4
\end_inset
|
\begin_inset Text
\layout Standard
160,000
\end_inset
|
\begin_inset Text
\layout Standard
85.9
\end_inset
|
\begin_inset Text
\layout Standard
4
\end_inset
|
\begin_inset Text
\layout Standard
160,000
\end_inset
|
\begin_inset Text
\layout Standard
90.7
\end_inset
|
\begin_inset Text
\layout Standard
4.8
\end_inset
|
|
\begin_inset Text
\layout Standard
K/L/H/I
\end_inset
|
\begin_inset Text
\layout Standard
3
\end_inset
|
\begin_inset Text
\layout Standard
8000
\end_inset
|
\begin_inset Text
\layout Standard
79.5
\end_inset
|
\begin_inset Text
\layout Standard
4
\end_inset
|
\begin_inset Text
\layout Standard
160,000
\end_inset
|
\begin_inset Text
\layout Standard
82.0
\end_inset
|
\begin_inset Text
\layout Standard
2.5
\end_inset
|
\end_inset
\layout Caption
\begin_inset LatexCommand \label{cap:best-compared}
\end_inset
A comparison of the best accuracy for each dataset using both the Naive
Bayes with k-grams and NB(k) methods.
k refers to the number of dependencies modelled by each algorithm.
Model size refers to the total number of probabilities per class computed
by each algorithm for the given value of k.
Acc.
refers to the accuracy.
For each algorithm we show the value k for which it obtained the best accuracy
on each dataset.
Improvement shows the percent improvement that NB(k) had over NB with k-grams.
\end_inset
\layout Section
Summary and Discussion
\layout Standard
To recapitulate, we presented two methods that relax the independence assumption
of Naive Bayes by modelling interaction among close neighbours in the sequence.
The two methods model k-wise dependency among successive elements in sequence
data.
Despite the modeling accuracy gain, the two proposed algorithms maintain
the
\begin_inset Quotes eld
\end_inset
one pass through the data only
\begin_inset Quotes erd
\end_inset
training property of Naive Bayes.
\layout Standard
We put to test our methods on three protein function prediction problems
with functional classes extracted from GO (the Gene Ontology,
\begin_inset LatexCommand \cite{gene00gene}
\end_inset
and corresponding sequences extracted from SWISSPROT
\begin_inset LatexCommand \cite{bairoch00swiss-prot}
\end_inset
.
The proposed methods compared favorably with Naive Bayes.
Furthermore NB(k) showed (as expected) the best performance in terms of
accuracy on all test cases (over NB k-grams).
NB(k) is a natural generalization of Naive Bayes (for k=1) and it constitutes
and end in itself.
That is, under the stationarity assumption it is the most complete and
correct probabilistic model of k-wise interaction among k successive elements
of the sequence.
\layout Standard
The improved modelling accuracy and also the demonstrated improvement in
test accuracy for increased values of k need to be taken with a grain of
salt.
That is they will hold, under the proviso that we have enough data to fit
the model probabilities.
\layout Standard
We have also shown the equivalence of the NB(k) undirected model with the
Markov Model of order k-1.
This contributes to dispelling a misconception that undirected models (i.e.,
random fields) are more suitable than directed models such as Markov Models
for modelling protein sequences since the order in the protein sequence
has a spatial nature rather than a temporal one as it is the case with
Markov Models.
While we do not dispute the spatial nature of protein sequences, we show
that both the undirected (spatial = NB(k) ) and the directed models (temporal
= Markov(k-1) Models) associated with with k-wise dependency yield the
same probability distribution.
Additionally NB(k), which is not incidentally also our best performer,
represents and end in itself since it is the most complete undirected graphical
model for k-grams modelling.
\layout Standard
Some directions for future work include: Studying the effect of varying
the alphabet size on the resulting accuracy.More in-depth examination of
the results, addressing questions such as: Are NB(k) picking out probabilities
similar to those of multiple sequence alignment motifs? A more in-depth
study using a wider range of datasets such as Intrusion Detection, ...
.
\layout Standard
\begin_inset LatexCommand \BibTeX[plain]{D:/Users/docs/lyx/_bibliography/bib}
\end_inset
\the_end