(Neeraj Koul)
In Cluster Analysis, one wishes to partition entities into groups based on given features of each entity, so that groups are homogeneous and well-separated. Clustering is one of the widely used methodologies to derive meaningful information from Gene Expression Data.
Hierarchical Clustering.
The hierarchical clustering algorithm used is based closely on the average-linkage method of Sokal and Michener [1] which was developed for clustering correlation matrixes such as those used here. The object of this algorithm is to compute a dendrogram that assembles all elements into a single tree. For any set of n genes, an upper-diagonal similarity matrix is computed by using a similarity metric/distance measure , which contains similarity scores for all pairs of genes. The matrix is scanned to identify the highest value (representing the most similar pair of genes). A node is created joining these two genes, and a gene expression profile is computed for the node by averaging observation for the joined elements (missing values are omitted and the two joined elements are weighted by the number of genes they contain). The similarity matrix is updated with this new node replacing the two joined elements, and the process is repeated n-1 times until only a single element remains
The two basic concepts involved in Hierarchical clustering are similarity metric and joining the two similar nodes to create a new
Expression profile.
Metrics
The metric is a methodology, which allows us to compute the distance between two nodes. Standard metrics from statistics can be used efficiently for gene expression analysis. However, they should be able to capture both positive and negative correlations. Any of the following metrics may be effectively used
1) Pearson’s Co-relation Coefficient
2) Chi Square
3) Mutual Information
All the above metrics capture both negative and positive correlations. This is in contrast to metrics such as Euclidean distance which capture only positive correlations
For the purpose of my experiment I plan to use Pearson’s Co-relation coefficient. Using this metric for any two genes X and Y observed over a series of N conditions, a similarity score (Pearson’s correlation coefficient) can be computed as follows:
|
|
|
where
|
|
|
Where Gi equal the
(log-transformed) primary data for gene G in condition
G
is the standard deviation of G,
equal to the Pearson correlation coefficient of the observations of X and Y. Instead of setting Goffset to the mean of the observations, I propose to set it to the boundary value which corresponds to the state at which gene is turned from ON to OFF and vice-versa.
We believe that in biological systems, it is not appropriate to average the observations after the two genes(or nodes) have been clustered to create a new expression profile for the node. The motivation for this comes from the fact that in biological systems the group of genes act together as a system to control/affect a third group.
Let us consider an ON-OFF model of the genes. According to this model genes can be either in two states viz. turned “ON”(active) or turned “OF” (not active)
Let us consider two systems (clusters) Ci and Cj. The various possibilities are enumerated below
Ci
Cj Ci ,Cj
OFF OFF OFF
OFF ON ON
ON OFF ON
ON ON ON
Here Ci ,Cj represents the system formed after taking Ci and Cj together. If we represent OFF by binary zero and ON by binary one
We have
Ci,Cj = -Ci * Cj
+ Ci
To apply this formula to non discrete data I plan to use the fuzzy definitions of the AND and OR as stated below
A * B = Minimum(A,B)
A + B = Maximum (A,B)
-A = 1- A. Here we assume that the range is [0,1]
As of now I propose to use the above methodology on the following data sets
1) Toy data
2) Non Biological data from UCI Irvine (http://www.ics.uci.edu/~mlearn/MLRepository.html)
Automobile Performance Data Set
3) Biological data
Dauxic shift (from DeRisi et al), http://cmgm.stanford.edu/pbrown/explore/array.txt
Cell cycle (Stanford), http://genome-www.stanford.edu/cellcycle/data/rawdata/combined.txt
|
[1] Sokal, R. R. &
Michener, C. D. (1958) Univ. Kans. Sci. Bull. 38, 1409-1438 [2] DeRisi,Iyer,Brown. Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale http://www.sciencemag.org/cgi/content/abstract/278/5338/680?ijkey=Yy3W4BUHKd9rA [3] Eisen,Spellman,Brown,Botstein. Cluster analysis and display of genome-wide expression patterns http://www.pnas.org/cgi/content/full/95/25/14863#B5 |