ComS 573: Machine Learning
Department of Computer Science
Iowa State University
Spring 2008


Laboratory Assignment 1

Due: Friday February 15, 12pm noon.

In this assignment, you will experiment with the Naive Bayes classifier.

  1. Download and install Weka.

Data

  1. Breast Cancer data file (already in the Weka format). It has 9 numeric attributes and 2 types of cancer to be predicted. (refer to ftp://ftp.ics.uci.edu/pub/machine-learning-databases/breast-cancer-wisconsin/ for the complete description of the data).
  2. House-votes dataset (the task is to predict whether the voter is a republican or a democrat based on their votes; the data description file can be found here). It has 16 binary attributes and 2 classes. One of your tasks is to convert this data into Weka readable format.
  3. Reuters Data - one of the benchmark datasets for text categorization and natural language processing. It is a collection of articles, that consists of articles in the top 10 categories. The task is to assign a topic to a new article. This dataset has been preprocessed to reduce the vocabulary size to 300 words using mutual information. The original data can be found here.

Tasks

  1. Estimate the accuracy of Naive Bayes classifier on Breast cancer dataset using 5-fold cross validation. Report 95% confidence interval. The dataset has numeric values. You can use Weka's filter to discretize the data into ten bins. The dataset has missing values (denoted by "?"). You may use Weka's missing values filter to fill them in or not. How does filling missing values affect the performance of the classifier?
  2. Estimate the accuracy of Naive Bayes classifier on House-votes dataset using 5-fold cross validation. Report 95% confidence interval. The dataset has missing values (denoted by "?"). You may use Weka's missing values filter to fill them in or not. How does filling missing values affect the performance of the classifier?
  3. Estimate the precision, recall, accuracy, and F-measure on the text classification task for each of the 10 categories using 10-fold cross-validation. Specify what model is used for the text classification.

 

What to Turn In

Turn in:

Submission instructions

You should submit your answers via email.
Compress the source code of your program (if you wrote some), the House-votes and Reuters data files (arff files), and the report of the results, and email the compressed (zip, rar, tar) file to rjordan at iastate.edu (change at by @) before the deadline. You will receive a confirmation email before 12:10 p.m. (noon).