Introduction of the term project for CS573x: A comparative evaluation of different machine learning methods on microarray gene expression data
Song Li
1 introduction
Microarray technology generates the data of the expression levels for a large amount of genes, and the same set of genes can be tested by different biological samples. The microarray data can be viewed as a two-dimensional table, with one dimension representing genes on the microarray chip, and one dimension representing samples has been tested.
Classification of samples is an important goal of analyzing microarray data. One interesting feature of microarray data is that the number of genes is usually very large (in thousands), and the number of samples is much smaller (less than 100). This so-called 'large p and small n' problem ([8]) makes many existing machine learning methods either cannot be applied or may not perform well.
To deal with this problem, many approaches for the microarray classification problem have been proposed in the recent years. Most of them can be categorized as:
-
SVM(support vector machine)-family methods. For example [6] and [1].
- Linear regression based methods. For example [2], [3] and [7]. And in [4] they are summarized as linear regression models.
- Bayesian-family methods. In [5] a model combining clustering and Bayesian inferencing is introduced.
The existing evaluation of these methods has problems as follows:
-
They were developed on different platforms, and the sometimes the implementation details are hard to obtain.
- They were applied on very limited datasets.
This project tries to use WEKA as the unique testing platform, and apply these algorithms on more datasets in addition to the datasets mentioned in the original paper. By doing this project, I hope the performance comparison among these algorithms can be clearer.
2 About the project
The project can be divided into two parts:
-
Coding part.
-
Write a program to convert existing microarray datasets into ARFF file.
- Design and Implement algorithms to realize the methods in [4] and [5] using Java, and make them within WEKA's framework.
- Analyzing part.
Apply three algorithms (besides two algorithms implemented in the project, WEKA's SVM implementation can be used directly) on the same microarray datasets, and write a final report to compare their performances. Currently I plan to get the microarray data from UPITT Cancer Gene Expression Data Set Link Database.
(http://bioinformatics.upmc.edu/Help/UPITTGED.html).
3 Schedule
-
Up to April 10: Literature search and data collection.
- April 10-17: Revisit the initial plan, finalize the algorithm to implement and the datasets for evaluation.
- April 17-24: Coding.
- April 24 - May 1: Finalize the coding, and start to write the final report.
- May 1 - May 4: finish the final report, set up project website.
References
- [1]
-
M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey,
M. Jr Ares, and D. Haussler.
Knowledge-based analysis of microarray gene expression data by using
support vector machines.
In Proc. Natl. Acad. Sci. USA 97: 262-267, 2000.
- [2]
-
T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov,
H. Coller, M.L. Loh, J.R. Downing, and M.A. Caliguiri.
Molecular classification of cancer: class discovery and class
prediction by gene expression monitoring.
Science, 285, 531-537, 1999.
- [3]
-
I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon,
P. Meltzer, B. Gustenson, M. Esterller, and O. Kallioniemi.
Gene-expression profiles in hereditary breast cancer.
N. Engl. J.Med. 344, 539-548, 2001.
- [4]
-
X. Huang and W. Pan.
Linear regression and two-class classification with gene expression
data.
Bioinformatics, 19: 2072 - 2078, 2003.
- [5]
-
X. Ji, K. Tsui, and K. Kim.
A novel means of using gene clusters in a two-step empirical bayes
method for predicting classes of samples.
Bioinformatics, 21: 1055 - 1061, 2005.
- [6]
-
D. Komura, H. Nakamura, S. Tsutsumi, H. Aburatani, and S. Ihara.
Multidimensional support vector machines for visualization of gene
expression data.
Bioinformatics, 21: 439-444, 2004.
- [7]
-
R. Tibshirani, R. Hastie, B. Narasimhan, and G. Chu.
Diagnosis of multiple cancer types by shrunken controids of gene
expression.
PNAS,99,6567-6572, 2002.
- [8]
-
M. West.
Bayesian factor regression models in the 'large p, small n' paradigm,
2002.
Discussion paper 02-12, Institute of Statics and Decision Science,
Duke University.
This document was translated from LATEX by
HEVEA.