ComS 573:
Machine Learning
Department of Computer Science
Iowa State University
Spring 2008
Laboratory Assignment 2
Due: Friday February 29, 12pm noon.
In this assignment, you will experiment with the Decision Tree classifier.
- Download and install Weka.
Data
- Breast Cancer data file
(already in the Weka format). It has 9 numeric attributes and 2 types of
cancer to be predicted. (refer to
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/breast-cancer-wisconsin/
for the complete description of the data).
- House-votes dataset (the task is to
predict whether the voter is a republican or a democrat based on their votes;
the data description file can be found here).
It has 16 binary attributes and 2 classes. One of your tasks is to convert
this data into Weka readable format.
-
Reuters Data - one of the benchmark datasets
for text categorization and natural language
processing. It is a collection of articles, that
consists of articles in the top 10 categories.
The task is to assign a topic to a new article.
This dataset has been preprocessed to reduce
the vocabulary size to 300 words using mutual
information. The original data can be found
here.
Tasks
- Estimate the accuracy of Decision Tree classifier on Breast cancer dataset
using 5-fold cross validation. Report 95% confidence interval. The dataset has
numeric values and missing values (denoted by "?").
- Estimate the accuracy of Decision Tree classifier on
House-votes dataset using 5-fold cross validation. Report 95% confidence
interval. The dataset has missing values.
- Estimate the precision, recall, accuracy, and F-measure of the decision
tree classifier on the text classification task for each of the 10 categories
using 10-fold cross-validation.
- In the above experiments compare your results with pruning option chosen
with that without pruning.
What to Turn In
Turn in:
- A report (in electronic form) of the results obtained with answers to the questions in the
Tasks section.
You should specify the parameters
of every experiment in such a way they can be replicated by the T.A. (e.g. Indicate whether you use the Weka's filter to discretize
the data), and justify your decision about the use (or not) of these
parameters.
- Any source code that you may have written.
- The House-votes data file in arff format. The Reuters data file in arff
format.
Submission instructions
You should submit your answers via email.
Compress the source code of your program (if you wrote some), the House-votes
and Reuters data files (arff files), and the report of the results,
and email the compressed (zip, rar, tar) file to rjordan at iastate.edu (change at by @) before
the deadline.
You will receive a confirmation email before 12:10 p.m. (noon).