Extracting information from large datasets is a well-studied research
problem. As larger and larger data sets become available (e.g., from human
genome project, gene expression data, customer behavior data from organizations
such as Wal-Mart) it is getting essential to find better ways to extract
relations (inferences) from them. We plan to use clustering and mutual
information based on entropy [1] to generate association
rules from biological datasets as well as non-biological datasets.
Microarray-based genomic surveys and other high-throughput approaches (ranging from genomics to combinatorial chemistry) are becoming increasingly important in biology and chemistry. As a result, we need to develop our ability to "see" the information in the massive tables of quantitative measurements that these approaches produce. Clustering [3,4] is an old studied technique used to extract this information from biological and other data sets. This follows from the fact that co-expressed genes have similar patterns of expression. Clustering groups records that are “similar” in the same group. It suffers from two major defects. It does not tell you how the two genes/clusters are exactly related. Moreover, it gives you a global picture and any relation at a local level can be lost.
We propose to use mutual information based on entropy for generating association rules [1]. Apart from the usual positive correlations between the genes, this criterion would also discover association rules with negative correlations in the data sets. We expect to find results of the form Gene1/\ Gene 2 –> ^Gene3, which can be interpreted as follows: Gene1 and Gene2 are co expressed and have silencing effect on Gene 3. We will compare the results from our experiments to those obtained from clustering. Since the approach is a general one we will also apply it on some non-biological databases to show that it can be used to extract useful information from these datasets as well.
We used 6 Datasets.
1) This data set was obtained from experiments conducted by DeRisi et al. at Stanford (Paper) (Database)
2) This data set was obtained from experiments conducted by the Yeast Cell Cycle Project at Stanford . (Paper) (Database)
3) This data set was obtained from experiments conducted by Eisen et al. at Stanford. (Paper) (Database).
4), 5) and 6) These three data sets were obtained from UCI Machine Learning Repository. It consisted of the breast cancer data set , cpu performance dataset and auto performance dataset.
The program takes as input the data file having the various attributes in binary form( 0 or 1). Further only certain attributes of the data set were relevant. Using domain knowledge we pruned the data set to represent the relevant attributes. Further the data was discretized into binary values. The binary values were chosen in accordance with interpretation required. A more finer level of discretization or working with real values would have been more appropriate but this approach also gives a lot of information. The data was formatted in the way expected by the program.
The program, that uses mutual information based on entropy to extract related attributes from data, was run on a 800 Mhz Pentium running Linux 6.2 with 256MB of main memory. If we run the program on a machine with more memory, we can handle data sets with more attributes and generate association rules between them.
2. Cell cycle (Stanford), data
3. Clustering combined yeast experiments (Eisen, Stanford), data
5. Automobile Performance Data Set
The results have been summarized in the following files
1.
Biological Data2. Non Biological Data Sets(pdf)
Presently
we have studied the problem in which the attributes can take only binary values.
It would be more useful to study similar problem with the multi-valued and real
valued attributes. The software [1] needs to be extended so
that it could handle real valued attributes as well as work with a large number
of attributes that is often the case for the large datasets.
It
would also be helpful to explore different classes of correlation metrics with
corresponding algorithms to build association rules and compare the results
obtained from this.
[1] Tiyagura "Mining Association Rules Based on Mutual Information" M.S thesis Dissertation at Iowa State University 1999
[2] Rakesh Agrawl,Tomasz Imielinski and Arun Swami "Mining association rules between sets of items in large databases." In Proc of ACM SIGMOD Confrence on Management of Data Washington D.C May 1993
[3] Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: "Cluster analysis and display of genome-wide expression patterns", Proc. Natl. Acad. Sci. USA 95: 14863-14868, 1998
[4] Joseph L. DeRisi,Vishwanath R. Iyer, Patrick O. Brown Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale. Science 1997 Oct 24;278(5338):680-6
[5] Spellman et al. : "Comprehnsuve Identification of Cell Cycle-regulated Genes of the Yeast Saccharmomyces cerevisiae by Microarray Hybridization." Mol, Biol.Cell Online Vol9, Issue 12,3273-3297,December 1998