" />

Iowa State University

Iowa State UniversityIowa State University
Machine Learning: Laboratory Assignment 1

Department of Computer Science

Laboratory Assignment 1

Laboratory Assignment 1

Due February 2, 2007

In this assignment, you will experiment with the Naive Bayes learner

  1. Download and install Weka (for instructions how to install it on your personal computer or on your CS space, you may refer to http://www.cs.iastate.edu/~cs573x/weka.html

Data

  1. Breast Cancer data file (already in the needed format). It has 9 numeric attributes and 2 types of cancer to be predicted. (refer to ftp://ftp.ics.uci.edu/pub/machine-learning-databases/breast-cancer-wisconsin/ for the complete description of the data).
  2. House-votes dataset (the task is to predict whether the voter is a republican or a democrat based on their votes; the data description file can be found here). It has 16 binary attributes and 2 classes. You will need to convert this data into Weka readable format.
  3. Reuters Data - one of the benchmark datasets for text categorization and natural language processing. It is a collection of articles, that consists of articles in the top 10 categories. The task is to assign a topic to a new article.

    This dataset has been preprocessed to reduce the vocabulary size to 300 words using mutual information. The original data can be found here.

Tasks

  1. Estimate the accuracy of Naive Bayes algorithm on this data set using 5-fold cross validation on the house-votes-84 data set. datasets. (Note: both datasets have missing values. You may use Weka's missing values filter to fill them in or not. How does filling missing values affect the performance of the classifier?
  2. Estimate the accuracy of the Naive Bayes classifier on this data set using 5-fold cross-validation. Breast cancer dataset has numeric values. You can use weka's filter to discretize the data. (again, you can handle missing values as in the previous case).
  3. Estimate the precision, recall, accuracy, and F-measure on the text classification task for each of the 10 categories using 10-fold cross-validation.

 

What to Turn In

Turn in via the turnin script (see the instructions on laboratory assignment page):

  • A report of the results obtained with answers to the questions in the Tasks section.
  • Any source code that you may have written.