A Brief Overview of Proposed Work

(Neeraj  Koul)

 

 

Clustering

 

In Cluster Analysis, one wishes to partition entities into groups based on given features of each entity, so that groups are homogeneous and well-separated. Clustering is one of the widely used methodologies to derive meaningful information from Gene Expression Data.

           

             Hierarchical Clustering.

 The hierarchical clustering algorithm used is based closely on the average-linkage method of Sokal and Michener [1] which was developed for clustering correlation matrixes such as those used here. The object of this algorithm is to compute a dendrogram that assembles all elements into a single tree. For any set of n genes, an upper-diagonal similarity matrix is computed by using a similarity  metric/distance measure  , which contains similarity scores for all pairs of genes. The matrix is scanned to identify the highest value (representing the most similar pair of genes). A node is created joining these two genes, and a gene expression profile is computed for the node by averaging observation for the joined elements (missing values are omitted and the two joined elements are weighted by the number of genes they contain). The similarity matrix is updated with this new node replacing the two joined elements, and the process is repeated n-1 times until only a single element remains

 

The two basic concepts involved in Hierarchical clustering are similarity metric and joining the two similar nodes  to create a new

Expression profile.

 

    Metrics

 

 The metric is a methodology, which allows us to compute the distance between two nodes. Standard metrics from statistics can be used efficiently for gene expression analysis. However, they should be able to capture both positive and negative correlations. Any of the following metrics may be effectively used

 

1)         Pearson’s Co-relation Coefficient

2)         Chi Square

      3)      Mutual Information

 

All the above metrics capture both negative and positive correlations. This is in contrast to metrics such as Euclidean distance which capture only positive correlations

 

 For the purpose of my experiment I plan to use Pearson’s Co-relation coefficient. Using this metric for any two genes X and Y observed over a series of N conditions, a similarity score (Pearson’s correlation coefficient) can be computed as follows:

S(X, Y)=<FR><NU>1</NU><DE>N</DE></FR> <LIM><OP>∑</OP><LL>i = 1,N</LL></LIM> <FENCE><FR><NU>X<SUB>i</SUB>−X<SUB>offset</SUB></NU><DE>&PHgr;<SUB>X</SUB></DE></FR></FENCE><FENCE><FR><NU>Y<SUB>i</SUB>−Y<SUB>offset</SUB></NU><DE>&PHgr;<SUB>Y</SUB></DE></FR></FENCE>

 

where

&PHgr;<SUB>G</SUB>=<RAD><RCD><LIM><OP>∑</OP><LL>i = 1,N</LL></LIM> <FR><NU>(G<SUB>i</SUB>−G<SUB>offset</SUB>)<SUP>2</SUP></NU><DE>N</DE></FR></RCD></RAD>.

 

Where Gi equal the (log-transformed) primary data for gene G in condition Phi G is  the standard deviation of G,

 

equal to the Pearson correlation coefficient of the observations of X and Y. Instead of setting Goffset   to the mean of the observations, I propose to set it to the boundary value which corresponds to the state at which gene is turned from ON to OFF and vice-versa.

 

 

Joining Two Nodes to Create a New Expression Profile

 

 We believe that in biological systems, it is not appropriate to average the observations after the two genes(or nodes) have been clustered to create a new expression profile for the node. The motivation for this comes from the fact that in biological systems the group of genes act together as a system to control/affect a third group.

 

      Let us consider an ON-OFF model of the genes. According to this model genes can be either in two states viz. turned “ON”(active) or turned “OF” (not active)

Let us consider two systems (clusters) Ci and Cj. The various possibilities are enumerated below

 

   Ci               Cj                       Ci ,Cj

                                                              

 OFF        OFF                         OFF

 OFF        ON                           ON

 ON          OFF                         ON

 ON          ON                           ON

 

 

Here Ci ,Cj    represents the system formed after taking   Ci  and Cj     together. If we represent OFF by binary zero and ON by binary one

We have

     

 

Ci,Cj    = -Ci * Cj    + Ci

 

 

To apply this formula to non discrete data I plan to use the fuzzy definitions of the AND and OR as stated below

 

    A * B =  Minimum(A,B)

   A + B = Maximum (A,B)

   -A = 1- A.   Here we assume that the range is [0,1]

 

 

 

 

Simulations Proposed

 

            As of now I propose to use the above methodology on the following data sets

1)      Toy data

2)      Non Biological data from UCI Irvine  (http://www.ics.uci.edu/~mlearn/MLRepository.html)

                          Breast Cancer Data Set

                         CPU Performance Data Set

                         Automobile Performance Data Set

 

3)      Biological data

                           Dauxic shift (from DeRisi et al),    http://cmgm.stanford.edu/pbrown/explore/array.txt

                           Cell cycle (Stanford), http://genome-www.stanford.edu/cellcycle/data/rawdata/combined.txt

 

                

References

 

[1] Sokal, R. R. & Michener, C. D. (1958) Univ. Kans. Sci. Bull. 38, 1409-1438

[2] DeRisi,Iyer,Brown. Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale

http://www.sciencemag.org/cgi/content/abstract/278/5338/680?ijkey=Yy3W4BUHKd9rA

[3] Eisen,Spellman,Brown,Botstein. Cluster analysis and display of genome-wide expression patterns

          http://www.pnas.org/cgi/content/full/95/25/14863#B5

[4] http://rana.lbl.gov/