Start with the code for the Naive Bayes Classifier which is available in WEKA.
Recall that the Naive Bayes Classifier assumes that the attributes are independent
given the class label. Extend the Naive Bayes implementation in ONE of two ways:
-
A Bayesian Network for classification based on the algorithm described in the paper:
Bayesian Network Classifiers Friedman, N., Geiger, D., and Goldszmidt, M. Machine Learning 29: pp. 131-163. 1997. Your algorithm should be able to handle missing attribute
values. You can use Dirichlet Priors for estimation of the relevant conditional probabilities.
OR
-
A Naive Bayes Classifier that takes advantage of user-supplied AVT described in the paper:
Learning Naive Bayes Classifiers from Attribute Value Taxonomies and Partially Specified Data. Jun Zhang and Vasant Honavar, Tech Report 2004-03, Department of Computer Science, Iowa State University. Your implementation should handle missing attribute values. It is not necessary
to handle partially specified attribute values, although you may do so if you so choose.
Your code should be implemented using the WEKA Machine Learning workbench, taking advantage of the built in functionality of WEKA.
A brief introduction of WEKA and its installation and
settings is found here.
As in the first lab assignment (decision trees), three different datasets are
provided to you, together with the Attribute Value Taxonomies (AVT) for two of
them. The datasets are already in Weka Arff, whereas the AVTs' format is
explained here.
- (1) The first dataset data set is "Nursery Database" from UC Irvine, which contains 12960 instances, 8 nominal attributes and No missing attribute values.
The attribute value taxonomy for this dataset can be found
here.
- (2) The second dataset
is "Congressional Voting Records Database" in 1984. All attributes are Boolean valued. The dataset contains some missing values. No
attribute value taxonomy for this dataset is provided. You have to construct it
yourself. With the guidelines above and the header of the .arff file it should
not take long.
- (3) The third dataset data set is "Mushroom Database", which contains about 10% missing attribute values for each attribute. The
attribute value taxonomy for this dataset can be found
here.
You are recommended to adapt the EvalBayesian class as the main class to evaluate the classifiers produced by your the algorithms that you will be implementing.
It will be much convenient for TA to run your program and make
comparisons. However, you can define your own main class to debug a particular classifier.
All the classifier classes are passed to
the Evaluation class. The evaluation() method can directly use the the command-line options, to specify the training and test files.
To add specific options, you can use the OptionHandler interface in
weka.core. So please be sure to include buildClassifier() and classifyInstance()
method in your classifier classes.
What to Turn In
You need to turn in a hardcopy of the following:
- Your documented source code.
- You should plan on turning in a report comparing the performance of the Standard Naive
Bayes Classifier with the classifier generated by either Friedman's algorithm or
Zhang's algorithm (depending on which one of the two you chose to implement)
on each data set.
You need to turn in electronically the following:
- The documented source code. As in lab1, you are free to present Javadoc
generated documentation (only electronic submission).
- The lab report (the digital form of the report mentioned above).
- An executable version of your code. Your code * should run on the CS
machines and reproduce the results reported *. For that,
include a readme file that specifies precisely the class names and
calling parameters for each of the above experiments (e.g. If
implemented Zhang's algorithm, you may add an extra parameter for loading
the AVT file, if so, explain it clearly in the readme file).
Some scattered recommendations inspired after inspection of lab 1 turn
ins:
- Do not submit the whole WEKA folder, class hierarchy or modified weka.jar.
You should find a way to avoid the need of that.
- In some cases it was considerably hard to have your code running. The
reasons where several, but the most notorious are the following:
- Missing README file or poorly explained. It's a must, if the README file is
missing or incomplete I'll subtract a substantial part of the grade (roughly
20%).
- Running your code required modifying original WEKA classes like
J48.class or ModelSelection.class. Moving files to the WEKA folder was not a
problem when the names of the files/classes were unique. However, if I change a
.class for one student, it becomes impossible to test others' code. To solve
this problem, one suggestion is that you name your classes in a unique way (e.g.
YOUR_NAME_J48.java for J48.java) or leave the classes in your lab directory and
run the code with the -classpath . option.
- I am not supposed to recompile your classes. Your submission should include
compiled code for JVM.
- Mark clearly in your code which were your modifications. I encountered cases
with 10 pages of code with no marking whatsoever of which were the
modifications!.
- Although answering precisely the questions posed at the end of the lab (not
the case for lab2) is strongly recommended, I would also encourage you to
present the report in a more professional, paper like fashion. The rule of thumb
here is that the report should be self-contained. The main missing parts I
encountered were:
- Tables summarizing the results of different algorithms.
- Figures with the modifications made to the original code.
- An introduction of the problem/s being addressed.
- More involved discussion of the results.
This recommendation does not mean a longer report. All these can be included
in a short report. The general layout is what I recommend you to re-consider.
- I should note that many of you did very well and therefore these
recommendations do not apply to their case. Check with me if you want personal
comments about your lab.
* If you have any question, or would like some
clarifications on the above recommendations, feel free to contact me. *
This page is maintained by: bromberg@cs.iastate.edu.
Dr. Vasant Honavar
Department of Computer Science
Iowa State University
Atanasoff Hall, Ames, IA 50011-1040 USA
phone: +1-515-294-4377, fax: +1-515-294-0258