Lab 2: Implementing a Decision Tree Based Learning Agent in Java
http://www.cs.iastate.edu/~cs572/labs/lab2/lab2.html
Out: Nov 11, 2002
Due: Dec 2, 2002
ComS 572 Principles of Aritifical Intelligence
Dimitris Margaritis
Department of Computer Science
Iowa
State University
TA, Jie Bao
Dept of Computer Science
Iowa State University
baojie@cs.iastate.edu
http://www.cs.iastate.edu/~baojie
Nov
18, 2002
Linear attribute is the real number attribute For example, in dataset 2, the first lines are (I omit most of the columns):
Them are all linear attribute also called nominal attribute or continuous attribute
To convert a linear you can transverse all the data and find the maximal and minimal value of that attribute, split the interval [min, max] into 3, 5 or more small intervals (no matter evenly or not), and define those smaller intervals as discrete domain
from http://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htm
Step 1: If all instances in C are positive, then create YES node and
halt.
If all instances in C are negative, create a NO node and halt.
Otherwise select a feature, F with values v1, ..., vn and create a decision
node.
Step 2: Partition the training instances in C into subsets C1, C2, ..., Cn
according to the values of V.
Step 3: apply the algorithm recursively to each of the sets Ci.
You can read the name file to a Attribute Table and the trainset and testset to Instance Tables. Since in this lab, all dataset are not too big, they can be entirely stored in the memory.
Attribute Table maintains the attribute #, string name, type(linear or not), the domain
Instance Table maintains the instance #, all the attribute value(may be converted into natural number), and goal class value.
modified from http://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htm
When choose a attribute to split a node(a collection of instances) information gain will be used.
Gain measures how well a given attribute separates training examples into targeted classes. The one with the highest information (information being the most useful for classification) is selected.
Given a collection S of c outcomes
where p(I) is the proportion of S belonging to class I. S is over c. Log2 is log base 2.
Note that S is not an attribute but the entire instance set.
If S is a collection of 14 examples with 9 YES and 5 NO examples then
Notice entropy is 0 if all members of S belong to the same class (the data is perfectly classified). The range of entropy is 0 ("perfectly classified") to 1 ("totally random").
Gain(S, A) is information gain of example set S on attribute A is defined as
Where:
S is each value v of all possible values of
attribute A
Sv = subset of S for which attribute A has value v
|Sv| = number of elements in Sv
|S| = number of elements in S
Eg: Suppose S is a set of 14 examples in which one of the attributes is wind speed. The values of Wind can be Weak or Strong. The classification of these 14 examples are 9 YES and 5 NO. For attribute Wind, suppose there are 8 occurrences of Wind = Weak and 6 occurrences of Wind = Strong. For Wind = Weak, 6 of the examples are YES and 2 are NO. For Wind = Strong, 3 are YES and 3 are NO. Therefore
Entropy(Sweak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) =
0.811
Entropy(Sstrong) = - (3/6)*log2(3/6) - (3/6)*log2(3/6) =
1.00
For each attribute, the gain is calculated and the highest gain is used in
the decision node.
In C5, to split a node(a collection of instances), you will compute the Gain for all attributes that haven't be used to split before, and choose the one with maximal Gain as next splitting attributes.
That also means you remember which attributes have already been used.
Note:
Actually you don't need to compute Entropy(S), just choosing attribute
with minimal S ((|Sv| /
|S|) * Entropy(Sv))
is enough. (why?)
use a linear vector to simulate the tree:
class Node {
public int[] children; //store the position
of children
public int parent; //store
the parent node
//CONSTRUTOR
public Node( int int parent) {
this.parent
= parent;
}
// to insert some new node into the tree as children of certain node
("parent")
static Vector decisionTree = new Vector(); //Storage for
decision tree
// add a new element in the tree
// suppose
"parent" is the index of current node in the tree.
Node
currentNode = (Node)decisionTree.elementAt(parent);
int childrenSize
= ...; // get number of children
currentNode.children = new int[childrenSize];
//
add all children at the end of the tree
for(int i = 0; i < childrenSize;
i++) {
currentNode.children[i]
= decisionTree.size();
Node
newChild = new Node(parent);
decisionTree.addElement(newChild);
}
Note that the Node correspoding to a collection of instances, so you may need a "bitmap" array with same size to instance set. Eg. if instance 7 is in the Node, array[7] =1 ; if instance 7 is not in Node, array[7]=0.
Q: For the design specifications on lab 2 there seem to be conflicting requirements. One line states: You need to write a program to randomly split the given data set into two subsets: one subset which is 2/3 of the original dataset to be used for training the decision tree and the other subset which corresponds to the remainder of the dataset to be used for testing the resulting decision tree.
And yet we are supposed to call the program as follows: java C5 <namesFile> <trainsetFilename> <testsetFilename>
So there seems to be a contradiction since the first statement implies that we enter the entire data set into our program and then partition it, whereas the second statement seems to imply that we enter the two sets manually.
Now I would imagine that the second line is in error and the calling function would be instead: java C5 <namesFile> <Filename>
A: 1- In lab2, you need to write a program to split the data randomly into test datase and training dataset. It's an independent program to C5.java, but is central to training effect.
You can save your testset and trainingset files and then use them in C5. It will save you lots time by avoiding splitting the dataset every time, since the dataset may be quite big(such as the 3rd one).
2- You should run the program C5 with parameters:
java C5 <namesFile> <trainsetFilename> <testsetFilename>
the <trainsetFilename> and <testsetFilename> are the output of the splitting program.
Q: So do we also need to turn in this separate data splitting program as well, or do we just use it for our own purposes to separate the data for testing?
A: Yes. It's important because the sequence in the dataset may contain relevance between adjacent data item, so a random splitting is needed. It's not a very difficult program. The core is like this:
randNum=3*java.lang.Math.random( );
if (randNum<2)
{ ...// add
curent data item to training set }
else
{ ...// add it to test set
}
...//move to next data item
Another reason for you to write a separate spliting program is you need to store your trainset and testset and test them with another Decision Tree Program . (I recommend you Weak)
|
|
C code by Dr.Dimitris Margaritis: http://www-2.cs.cmu.edu/~dmarg/Source/DT.html |
You can read them and find inspiration, but NEVER copy them! Write every line by yourself.
[Return to Jie Bao's Homepage]