Introduction of the term project for CS573x: A comparative evaluation of different machine learning methods on microarray gene expression data

Song Li

1  introduction

Microarray technology generates the data of the expression levels for a large amount of genes, and the same set of genes can be tested by different biological samples. The microarray data can be viewed as a two-dimensional table, with one dimension representing genes on the microarray chip, and one dimension representing samples has been tested.

Classification of samples is an important goal of analyzing microarray data. One interesting feature of microarray data is that the number of genes is usually very large (in thousands), and the number of samples is much smaller (less than 100). This so-called 'large p and small n' problem ([8]) makes many existing machine learning methods either cannot be applied or may not perform well.

To deal with this problem, many approaches for the microarray classification problem have been proposed in the recent years. Most of them can be categorized as:
  1. SVM(support vector machine)-family methods. For example [6] and [1].
  2. Linear regression based methods. For example [2], [3] and [7]. And in [4] they are summarized as linear regression models.
  3. Bayesian-family methods. In [5] a model combining clustering and Bayesian inferencing is introduced.
The existing evaluation of these methods has problems as follows:
  1. They were developed on different platforms, and the sometimes the implementation details are hard to obtain.
  2. They were applied on very limited datasets.
This project tries to use WEKA as the unique testing platform, and apply these algorithms on more datasets in addition to the datasets mentioned in the original paper. By doing this project, I hope the performance comparison among these algorithms can be clearer.

2  About the project

The project can be divided into two parts:

3  Schedule

References

[1]
M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Jr Ares, and D. Haussler. Knowledge-based analysis of microarray gene expression data by using support vector machines. In Proc. Natl. Acad. Sci. USA 97: 262-267, 2000.

[2]
T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, and M.A. Caliguiri. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 285, 531-537, 1999.

[3]
I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon, P. Meltzer, B. Gustenson, M. Esterller, and O. Kallioniemi. Gene-expression profiles in hereditary breast cancer. N. Engl. J.Med. 344, 539-548, 2001.

[4]
X. Huang and W. Pan. Linear regression and two-class classification with gene expression data. Bioinformatics, 19: 2072 - 2078, 2003.

[5]
X. Ji, K. Tsui, and K. Kim. A novel means of using gene clusters in a two-step empirical bayes method for predicting classes of samples. Bioinformatics, 21: 1055 - 1061, 2005.

[6]
D. Komura, H. Nakamura, S. Tsutsumi, H. Aburatani, and S. Ihara. Multidimensional support vector machines for visualization of gene expression data. Bioinformatics, 21: 439-444, 2004.

[7]
R. Tibshirani, R. Hastie, B. Narasimhan, and G. Chu. Diagnosis of multiple cancer types by shrunken controids of gene expression. PNAS,99,6567-6572, 2002.

[8]
M. West. Bayesian factor regression models in the 'large p, small n' paradigm, 2002. Discussion paper 02-12, Institute of Statics and Decision Science, Duke University.

This document was translated from LATEX by HEVEA.