Laboratory Assignment 1
|
Laboratory Assignment 1
Due February 2, 2007
In this assignment, you will experiment with
the Naive Bayes learner
- Download and install Weka (for instructions how to install it on your
personal computer or on your CS space, you may refer to
http://www.cs.iastate.edu/~cs573x/weka.html
Data
- Breast Cancer data file
(already in the needed format). It has 9 numeric attributes and 2 types of
cancer to be predicted. (refer to
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/breast-cancer-wisconsin/
for the complete description of the data).
- House-votes dataset (the task is to predict whether the voter is a
republican or a democrat based on their votes; the data description file can
be found here). It has 16 binary attributes
and 2 classes. You will need to convert this data into Weka readable
format.
-
Reuters Data - one of the benchmark datasets for text categorization and
natural language processing. It is a collection of articles, that consists of
articles in the top 10 categories. The task is to assign a topic to a new
article.
This dataset has been preprocessed to reduce the vocabulary size to 300 words
using mutual information. The original data can be found
here.
Tasks
- Estimate the accuracy of Naive Bayes algorithm on this data set using 5-fold cross validation on the house-votes-84 data set.
datasets. (Note: both datasets have missing values. You may use Weka's missing values filter to fill
them in or not. How does filling missing values affect the performance of the classifier?
- Estimate the accuracy of the Naive Bayes classifier on this data set using 5-fold cross-validation. Breast cancer dataset has numeric values. You can use weka's filter to
discretize the data. (again, you can handle missing values as in the previous case).
- Estimate the precision, recall, accuracy, and F-measure on the text classification task for each of the 10 categories using 10-fold cross-validation.
What to Turn In
Turn in via the turnin script (see the instructions on
laboratory assignment page):
- A report of the results obtained
with answers to the questions in the Tasks section.
- Any source code that you may have written.
|