Mining Association Rules from Data Sets

(Neeraj Koul, Alexei Zapari and Smruti Behera)

Abstract
Introduction
Methods
Data Sets
Results
Future Work
References
Links

Source Files

Final Paper(pdf)

Appendix

 


 

Abstract

Extracting information from large datasets is a well-studied research problem. As larger and larger data sets become available (e.g., from human genome project, gene expression data, customer behavior data from organizations such as Wal-Mart) it is getting essential to find better ways to extract relations (inferences) from them. We plan to use clustering and mutual information based on entropy [1] to generate association rules from biological datasets as well as non-biological datasets.

Back to Start


 

Introduction

Microarray-based genomic surveys and other high-throughput approaches (ranging from genomics to combinatorial chemistry) are becoming increasingly important in biology and chemistry. As a result, we need to develop our ability to "see" the information in the massive tables of quantitative measurements that these approaches produce. Clustering [3,4] is an old studied technique used to extract this information from biological and other data sets. This follows from the fact that co-expressed genes have similar patterns of expression. Clustering groups records that are “similar” in the same group. It suffers from two major defects. It does not tell you how the two genes/clusters are exactly related. Moreover, it gives you a global picture and any relation at a local level can be lost.

 We propose to use mutual information based on entropy for generating association rules [1]. Apart from the usual positive correlations between the genes, this criterion would also discover association rules with negative correlations in the data sets. We expect to find results of the form Gene1/\ Gene 2 –> ^Gene3, which can be interpreted as follows: Gene1 and Gene2 are co expressed and have silencing effect on Gene 3. We will compare the results from our experiments to those obtained from clustering. Since the approach is a general one we will also apply it on some non-biological databases to show that it can be used to extract useful information from these datasets as well.

Back to Start


 

Methods

Collecting data

    We used 6 Datasets.

     1) This data set was obtained from experiments conducted by DeRisi et al.  at Stanford (Paper) (Database)

    2) This data set was obtained from experiments conducted by the Yeast Cell Cycle Project at Stanford . (Paper) (Database)

    3) This data set was obtained from experiments conducted by Eisen  et al. at Stanford. (Paper) (Database).

    4), 5) and 6)  These three data sets were obtained from UCI Machine Learning Repository. It consisted of the breast cancer data set , cpu performance dataset and auto performance dataset.

Discretizing  and Formatting the data

     The program takes as input the data file having the various attributes in binary form( 0 or 1). Further only certain attributes of the data set were relevant. Using domain knowledge we pruned the data set to represent the relevant attributes. Further the data was discretized into binary values. The binary values were chosen  in accordance with interpretation required. A more finer level of discretization or working with real values would have been more appropriate but this approach also gives a lot of information. The data was formatted in the way expected by the program.

Running

    The program, that uses mutual information based on entropy to extract related attributes from data, was  run on a 800 Mhz Pentium running Linux 6.2 with 256MB of main memory. If we run the program on a machine with more memory, we can handle data sets with more attributes and generate association rules between them.

Back to Start

 


 

 

Data Sets

1. Dauxic shift (from DeRisi et al), data

2. Cell cycle (Stanford), data

3. Clustering combined yeast experiments (Eisen, Stanford), data

3. Breast Cancer Data Set

4. CPU Performance Data Set

5. Automobile Performance Data Set

 

Back to Start


 

Results

    The results have been summarized in the following files

1. Biological Data 

2. Non Biological Data Sets(pdf)

 

Back to Start


 

Future Work

Presently we have studied the problem in which the attributes can take only binary values. It would be more useful to study similar problem with the multi-valued and real valued attributes. The software [1] needs to be extended so that it could handle real valued attributes as well as work with a large number of attributes that is often the case for the large datasets.

It would also be helpful to explore different classes of correlation metrics with corresponding algorithms to build association rules and compare the results obtained from this.

Back to Start


 

 

References

[1] Tiyagura "Mining Association Rules Based on Mutual Information" M.S thesis Dissertation at Iowa State University 1999

[2] Rakesh Agrawl,Tomasz Imielinski and Arun Swami  "Mining association rules between sets of items in large databases." In Proc of ACM SIGMOD Confrence on Management of Data Washington D.C May 1993

[3] Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: "Cluster analysis and display of genome-wide expression patterns", Proc. Natl. Acad. Sci. USA 95: 14863-14868, 1998

[4] Joseph L. DeRisi,Vishwanath R. Iyer, Patrick O. Brown Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale. Science 1997 Oct 24;278(5338):680-6

[5] Spellman et al. : "Comprehnsuve Identification of Cell Cycle-regulated Genes of the Yeast Saccharmomyces cerevisiae by Microarray Hybridization." Mol, Biol.Cell Online Vol9, Issue 12,3273-3297,December 1998

Back to Start


 

 

Links

http://industry.ebi.ac.uk/~brazma/Data-mining/microarray.html

http://industry.ebi.ac.uk/~alan/MicroArray/

http://www.microarrays.org/

http://cmgm.stanford.edu/pbrown/explore/

http://genome-www.stanford.edu/Saccharomyces/

http://www.ncgr.org/research/genex/other_tools.html

http://www.cs.washington.edu/homes/jbuhler/research/array/

http://www.hgmp.mrc.ac.uk/GenomeWeb/nuc-genexp.html

http://www.ebi.ac.uk/research/pfmp/publications/biol_chem_2000/Biol_Chem-MS-revised.html

Back to Start