ComS 573:
Machine Learning
Department of Computer Science
Iowa State University
Spring 2008
Laboratory Assignment 1
Due: Friday February 15, 12pm noon.
In this assignment, you will experiment with the Naive Bayes classifier.
- Download and install Weka.
Data
- Breast Cancer data file
(already in the Weka format). It has 9 numeric attributes and 2 types of
cancer to be predicted. (refer to
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/breast-cancer-wisconsin/
for the complete description of the data).
- House-votes dataset (the task is to
predict whether the voter is a republican or a democrat based on their votes;
the data description file can be found here).
It has 16 binary attributes and 2 classes. One of your tasks is to convert
this data into Weka readable format.
-
Reuters Data - one of the benchmark datasets
for text categorization and natural language
processing. It is a collection of articles, that
consists of articles in the top 10 categories.
The task is to assign a topic to a new article.
This dataset has been preprocessed to reduce
the vocabulary size to 300 words using mutual
information. The original data can be found
here.
Tasks
- Estimate the accuracy of Naive Bayes classifier on
Breast cancer dataset using 5-fold cross validation. Report 95% confidence
interval. The dataset has numeric
values. You can use Weka's filter to discretize the data into ten bins. The
dataset has missing values (denoted by "?"). You may use Weka's missing values filter to fill them in
or not. How
does filling missing values affect the performance of the classifier?
- Estimate the accuracy of Naive Bayes classifier on
House-votes dataset using 5-fold cross validation. Report 95% confidence
interval. The dataset has missing values (denoted by "?"). You may use Weka's missing values filter to fill them in
or not. How
does filling missing values affect the performance of the classifier?
- Estimate the precision, recall, accuracy, and F-measure on the text
classification task for each of the 10 categories using 10-fold
cross-validation. Specify what model is used for the text classification.
What to Turn In
Turn in:
- A report (in electronic form) of the results obtained with answers to the questions in the
Tasks section.
You should specify the parameters
of every experiment in such a way they can be replicated by the T.A. (e.g. Indicate whether you use the Weka's filter to discretize
the data), and justify your decision about the use (or not) of these
parameters.
- Any source code that you may have written.
- The House-votes data file in arff format. The Reuters data file in arff
format.
Submission instructions
You should submit your answers via email.
Compress the source code of your program (if you wrote some), the House-votes
and Reuters data files (arff files), and the report of the results,
and email the compressed (zip, rar, tar) file to rjordan at iastate.edu (change at by @) before
the deadline.
You will receive a confirmation email before 12:10 p.m. (noon).