ComS 573: Machine Learning
Department of Computer Science
Iowa State University
Spring 2008


Laboratory Assignment 2

Due: Friday February 29, 12pm noon.

In this assignment, you will experiment with the Decision Tree classifier.

  1. Download and install Weka.

Data

  1. Breast Cancer data file (already in the Weka format). It has 9 numeric attributes and 2 types of cancer to be predicted. (refer to ftp://ftp.ics.uci.edu/pub/machine-learning-databases/breast-cancer-wisconsin/ for the complete description of the data).
  2. House-votes dataset (the task is to predict whether the voter is a republican or a democrat based on their votes; the data description file can be found here). It has 16 binary attributes and 2 classes. One of your tasks is to convert this data into Weka readable format.
  3. Reuters Data - one of the benchmark datasets for text categorization and natural language processing. It is a collection of articles, that consists of articles in the top 10 categories. The task is to assign a topic to a new article. This dataset has been preprocessed to reduce the vocabulary size to 300 words using mutual information. The original data can be found here.

Tasks

  1. Estimate the accuracy of Decision Tree classifier on Breast cancer dataset using 5-fold cross validation. Report 95% confidence interval. The dataset has numeric values and  missing values (denoted by "?").
  2. Estimate the accuracy of Decision Tree classifier on House-votes dataset using 5-fold cross validation. Report 95% confidence interval. The dataset has missing values.
  3. Estimate the precision, recall, accuracy, and F-measure of the decision tree classifier on the text classification task for each of the 10 categories using 10-fold cross-validation.
  4. In the above experiments compare your results with pruning option chosen with that without pruning.

 

What to Turn In

Turn in:

Submission instructions

You should submit your answers via email.
Compress the source code of your program (if you wrote some), the House-votes and Reuters data files (arff files), and the report of the results, and email the compressed (zip, rar, tar) file to rjordan at iastate.edu (change at by @) before the deadline. You will receive a confirmation email before 12:10 p.m. (noon).