Homework 4
Cs472/572 Artificial Intelligence
Dept
of Computer Science
Iowa State University
http://www.cs.iastate.edu/~cs572/assignments/assignment4.pdf
TA ,Jie Bao
Dept of Computer Science
Iowa State University
baojie@cs.iastate.edu
http://www.cs.iastate.edu/~baojie
Dec
6, 2002
1.(30 pts) Solve exercise 15.1 from the Russell and Norvig text.
Adding variables to an existing net can be done in two ways. Formally speaking, one should insert the variables into the variable ordering and rerun the network construction process from the point where the first new variable appears. Informally speaking, one never really builds a network by a strict ordering. Instead, one asks what variables are direct causes or influences on what other ones, and built local parent/child graphs that way. It is usually easy to identify wherein such a structure the new variable goes, but one must be very careful to check for possible induced dependencies downstream.
a. IcyWeather is not caused by any
of the car-related variables, so needs no parents. It directly affects the
battery and the starter motor StarterMotorWorking (SMW for short) is an additional
precondition for Starts. The new network is shown in the figure.
b. Reasonable probabilities may vary a lot depending on the kind of car and perhaps the personal experience of the assessor. The following values indicate the general order of magnitude and relative values that make sense:
c. With 8 Boolean variables, the joint has 2^8 - 1 = 255 independent entries.
d. Given the topology shown in Figure the total number of independent CPT entries is
1+2+2+2+2+1+8+2= 20.
e. The CPT for Starts describes a set of necessary conditions that are together almost sufficient That is, all the entries art zero except for the entry where all the conditions are true. That entry will be not quite I (because there is always some other possible fault that we didn’t think of, but as we add more conditions it gets closer to I.
2.(30 pts) Solve exercise 15.2 from the Russell and Norvig text.
This question has some tough parts but exercises many aspects of the student's understanding of belief networks and uncertainty. It was first given as a final exam question for UC Berkeley's undergraduate Intro to AI class, at which time only 3 of 35 students got it completely correct in the time available (20 minutes).
a. A suitable network is shown in the figure. The key aspects are: the failure nodes are parents of the sensor nodes, and the temperature node is a parent of both the gauge and the gauge failure node. It is exactly this kind of correlation that makes it difficult for humans to understand what is happening in complex systems with unreliable sensors.
b. No matter which way the student draws the network, it should not be a polytree because of the fact tbat the temperature influences the gauge in two ways.
c. The CPT is shown below. The wording of the question is a little tricky because x and y are defined in terms of "incorrect" rather than "correct"
|
|
T=Normal |
T=High |
||
|
|
FG |
~FG |
FG |
~FG |
|
G=Normal |
1-y |
1-x |
y |
x |


(a) A belief network for the nuclear alarm problem (b) Adding a second gauge
d. Suppose the alarm works unless it is faulty, in which case it never goes off. Give the conditional probability table associated with A.
|
|
G=Normal |
G=High |
||
|
|
FA |
~FA |
FA |
~FA |
|
A |
0 |
0 |
0 |
1 |
e This part actually asks the student to do something usually done by belief network algorithms. The great thing is that doing the calculation without a belief network makes it easy to see the nature of the calculations that the algorithms are systematizing. It illustrates the magnitude of the achievement involved in creating complete and corrected algorithms.
Abbreviating T = High and G = High by T and G, the probability of interest here is P(T|A, ~FG, ~FA). Because the alarm's behavior is deterministic, we can reason that if the alarm is working and sounds, G must be High. Because FA and A are d-separated from T, we need only calculate P(T | ~FG, G).
There are several ways to go about doing this. The "opportunistic way is to notice that the CPT entries give us P(G | T, ~FG). which suggests using the generalized Bayes' Rule to switch G and T with ~FG as background:
P(T | ~FG,G) ∝ P(G | T,~FG) P(T |~FG)
We then use Bayes' Rule again on the last term:
P(T|~FG,G) ∝ P(G | ~FG) P(~FG|T)P(T)
A similar relationship holds for ~T:
P(~T | ~FG, G) ∝ P(G| ~T, ~FG)P(~FG|~T)P(~T)
Normalizing, we obtain
|
P(T | ~FG, G) = |
P(G| T, ~FG) P(~FG |
T)P(T) |
The "systematic" way to do it is to revert to joint entries (noticing that the subgraph of T, G, and FG is completely connected so no loss of efficiency is entailed). We have
|
P(T | ~FG, G) = |
P(T | ~FG, G) |
= |
P(T, ~FG, G)
|
Now we use the chain rule formula (Equation 15.1 on page 439) to rewrite the joint entries as CPT entries:
|
P(T | ~FG, G) = |
P(T)P(~FG|T)P(G|T,~FG)
|
which of course is the same as the expression arrived at above, Letting P(T) = p, P(FG|T)=g, and P(FG|~T) = h, we get
|
P(T | ~FG, G) = |
p(1
- g)(1 - x) |
f. As with the discussion of backgammon in Chapter 5, this question assumes that expected value decision-making needs no lengthy analysis, being more or less "common sense." The idea here is therefore to calculate the expected cost of shutting down (S) or not shutting down (~S) the reactor when the alarm does or does not sound. (If we have to shut down the reactor even when the alarm does not sound, then our sensing system is no use -an object lesson in the importance of good sensing systems as well as safe reactors!) For the comparison, we will need to know
|
P(T|~A, ~FG, ~FA)= P(T|~G, ~FG)= |
p(1-g)x |
Cost(S) = Cs
|
Cost (~S|A, ~FG, ~FA) =CmP(T|A, ~FG, ~FA)=Cm |
p(1 -g)(1 -x)
|
|
Cost (~S|A, ~FG, ~FA) =CmP(T|~A, ~FG, ~FA)=Cm |
p(1-g)x
|
If the alarm does not sound, shutting down is the best option if
|
Cs<Cm |
p(1-g)x
|
which gives the condition on x as
|
x> |
1 |
|
|
1+(cm/cs -1) |
p(1-g) |
|
If g, h, and p are all « 1, and cs « cm, the condition is approximately x> cs/(cmp). To put this in perspective, suppose the cost of a shutdown, cs, is 105 dollars, the cost of a meltdown, Cm, is 1011 dollars, and the probability of high temperature, p is 10-5 Then the condition is x> 0.1, that is, we would always shut down if the probability of gauge failure is higher than 0.1.
g. The new network is shown in figure(b), The important point here is that even though the two gauges are highly correlated, they are conditionally independent given T and hence are not directly connected.
h. This question is a "common-sense?question that leads into the general idea of information value (Section 16.6). It is a basic property of rational agents that they never do worse with more informafion. Thus, adding a second gauge with the same level of accuracy cannot degrade the overall performance of the system. The "proof" is trivial: suppose that it did. Then the agent would have higher expected utility if it simply ignored the additional information, which means the performance would be the same as before,
3.(30 pts) Solve exercise 15.3 from the Russell and Norvig text.
This is a much easier question than 15.2.
a. Although (i) in some sense depicts the flow of information?during calculation, it is clearly incorrect as a network, since it says tat given the measurements M1 and M2, the number of stars is independent of the focus, (ii) correctly represents the causal stricture: each measurement is influenced by the actual number of stars and the focus, and the two telescopes are independent of each other. (iii) shows a correct but more complicated network-the one obtained by ordering the nodes M1, M2, N, F1, F2. If you order M2 before M1 you would get the same network except with the arrow from M1 toM2 reversed.
b, (ii) requires fewer parameters and is therefore better than (iii).
c.To compute P(M1 | N), we will need to condition on F1 (that is, consider both possible cases for F1, weighted by their probabilifies).
P(M1 | N) = P(M1 |N, F1)P(F1 |N) + P(M1 |N, ~F1)P(~F1 |N) = P(M1 | N,F1)P(F1) +P(M1|N,~F1)P(~F1)
Let f be the probability that the telescope is out of focus. The exercise states that this will cause an “undercount of three or more stars.” For N=3 or less stars, we assume this means the count will be C) if the telescope is out of focus. If it is in focus, then we will assume there is a probability of e of counting one two few, and e of counting one too many. The rest of the time (1 — 2e). the count will be accurate. Then the table is as follows:
|
|
N=1 |
N=2 |
N=3 |
M1=0 |
f+e(1-f) |
f |
f |
M1=1 |
(1-2e)(1-f) |
e(l-f) |
0.0 |
M1=2 |
e(l-f) |
(1-2e)(1-f) |
e(l-f) |
M1=3 |
0.0 |
e(l-f) |
(1-2e)(1-f) |
M1=4 |
0.0 |
0.0 |
e(l-f) |
Notice that each column has to add up to 1. Reasonable values for e andf might be 0.05 and 0.02.
d. This question causes a surprising amount of difficulty, so it is important to make sure students understand the reasoning behind an answer. One approach uses the fact that it is easy to reason in the forward direction, that is, try each possible number of stars N and see whether measurements M1 1 and M2 = 3 are possible. (This is a sort of mental simulation of the physical process.) An alternative approach is to enumerate the possible focus states and deduce the value of N for each. Either way, the solutions are N = 2,4, or>= 6.
For follow-up or alternate questions, it is easy to come up with endless variations on the same theme involving sensors, failure nodes, hidden state. One can also add in complex mechanisms, as for the Starts variable in exercise 15.1.
4.(30 pts) Solve exercise 18.4 from the Russell and Norvig text.
In standard decision trees, attribute tests divide examples according to the attribute value. Therefore any example reaching the second test already has a known value for the attribute and the second test is redundant. In some decision tree systems, however, all tests are Boolean even if the attributes are multivalued or continuous. In this case, additional tests of the attribute can be used to check different values or subdivide the range further (e.g., first check if X > 0, and then if it is, check if x> 10).
5.(30 pts) Solve exercise 18.8 from the Russell and Norvig text.
Suppose that we draw m examples. Each example has a input features plus its classification, so there are 2^(n+1) distinct input/output examples to choose from. For each example, there is exactly one contradictory example, namely the example with the same-input features but the opposite classification. Thus, the probability of finding no contradiction is
|
number of sequences of m non-contndictory
examples |
= |
|
= |
|
For n = 10 with 2048 possible examples, a contradiction becomes likely with probability>0.5 after 54 examples have been drawn.
6**.(30 pts) Solve exercise 18.9 from the Russell and Norvig text.
This result emphasizes the fact that any statistical fluctuations caused by the random sampling process will result in an apparent information gain.
The easy part is showing that the gain is zero when each subset has the same ratio of positive examples. The gain is defined as

Since p = Σpi and n=Σni, if pi/(pi+ni) = pj(pj + nj) for all i,j then we must have pi/(pi + ni) = p/(p + n) for all i, and also ni/(pi + ni) = n/(p + n). From this, we obtain

7.(30 pts) Solve exercise 18.13 from the Russell and Norvig text.
Note: this is the only exercise to cover the material in section 18.6. Although the basic ideas of computational learning theory are both important and elegant, it is not easy to find good exercises that are suitable for an AI class as opposed to a theory class. If you are teaching a graduate class, or an undergraduate class with a strong emphasis on learning, it might be a good idea to use some of the exercises from Kearns and Vazirani (1994).
a. If each test is an arbitrary conjunction of literals, then a decision list can represent an arbitrary DNF (disjunctive normal form) formula directly. The DNF expression C1 V C2 V . . . V Cn, where Ci is a conjunction of literals, can be represented by a decision list in which Ci is the ith test and returns True if successful. That is:
C1 -> True;
C2 -> True;
. . .
Cn -> True;
True —>
False
Since any Boolean function can be written as a DNF formula, then any Boolean function can be represented by a decision list.
b. A decision tree of depth k can be translated into a decision list whose tests have at most k literals simply by encoding each path as a test. The test returns the corresponding leaf value if it succeeds, Since the decision tree has depth k, no path contains more than k Literals,
8.(30 pts) Solve exercise 19.1 from the Russell and Norvig text.
XOR
(in fact any Boolean function) is easiest to construct using step-function
units. Because XOR is not linearly separable, we will need a hidden layer
It turns out that just one hidden node suffices. To design the network,
we can think of the XOR function as OR with the AND case (both inputs on)
ruled out. Thus the hidden layer computes AND, while the output layer computes
OR but weights the output of the hidden node negatively. The network shown
in Figure S 19.1 does the trick. (Figure: A network of step-function
neuwns that computes the XOR function.)
9.(30 pts) Solve exercise 19.3 from the Russell and Norvig text.
This exercise reinforces the student's understanding of neural networks as mathematical functions that can be analyzed at a level of abstraction above their implementation as a network of computing elements. Eor simplicity, we will assume that the activation function is the same linear function at each node: g(x) = cx +d. (The argument is the same (only messier) if we allow different ci and di for each node.)
a. The outputs of the hidden layer are
![]()
The final outputs are
![]()
Now we just have to see that this is linear in the inputs:

Thus we can compute the same function as the two-layer network using
just a one. layer perceptron that has weights
and
an activation function g(x)=![]()
h. The above reduction can be used straightforwardly to reduce an n-layer network to an (n-1)-layer network. By induction, the n-layer network can be reduced to a single-layer network. Thus, linear activation function restrict neural networks to represent only linearly function.
10**.(30 pts) Solve exercise 19.6 from the Russell and Norvig text.
This question is especially important for students who are not expected to implement or use a neural network system. Together with 19.3, it gives you a concrete (if slender) grasp of what the network actually does. Many other similar questions can be devised.
Intuitively, the data suggest that a probabilistic prediction P(Output = 1) : 0.8 is appropriate. The network will adjust its weights to minimize the error function. The error is
![]()
The derivative of the error with respect to the single output O1 is
![]()
Setting the derivative to zero, we find that indeed O1 =0.8, The student should spot the connection to Ex. 18,7, which shows that the network will always report the observed fraction on decision trees.
11.(30 pts) Solve exercise 16.2 from the Russell and Norvig text.
The expected monetary value of the lottery L is
EMV(L) = 1/50 * $10 + 1/2,000,000 * $1,000,000= $0.70
Although $0.70 < $1 it is not necessarily irrational to buy the ticket.
First we will consider just the utilities of the monetary outcomes, ignoring
the utility of actually playing the lottery game. Using U($n) to represent
the utility to the agent of having n dollars more than the current state,
and assuming that utilityis liner for small values of money (i.e., U($n)
n
( U($1) — U($0)) for -10 <= n <= 10). the utility of the
lottery is:
U(L) = 1/50 * U($10)
+ 1/2,000,000 * U($1,000,000)
1/5
* U($1) + 1/2,000,000 * U($1,000,000)
This is more than U($1) when U($1,000,000)> 1,600,000 U($1). Thus, for a purchase to be rational (when only money is considered), the agent must be quite risk-seeking. This would be unusual for low-income individuals, for whom the price of a ticket is non-trivial. It is possible that some buyers do not internalize the magnitude of the very low probability of winning-to imagine an event is to assign it a "non-trivial" probability, in effect. Apparently, these buyers are betterat internalizingthe large magnitude of the prize Such buyers are clearly acting irrationally.
Some people may feel their current situation is intolerable, that is,
U($0) = -
Therefore the situation of having one dollar less would be equally intolerable,
and it would be rational to gamble on a high payoff, even if one that has
low probability.
Gamblers also derive pleasure from the excitement of the lottery and the temporary possession of at least a non-zero chance of wealth. So we should add to the utility of playing the lottery the term t to represent the thrill of participation. Seen this way, the lottery is just another form of entertainment, and buying a lottery ticket is no more irrational than buying a movie ticket. Either way, you pay your money, you get a small thill t, and (most likely) you walk away empty-handed.
12.(30 pts) Solve exercise 16.7 from the Russell and Norvig text.
This exercise can be solved using an influence diagram package such as IDEAL. The specific values are not especially important. Notice how the tedium of encoding all the entries in the utility table cries out for a system that allows the additive, multiplicative, and other forms sanctioned by MAUT.
One of the key aspects of the fully explicit representation in Figure 16.4 is its amenability to change. By doing this exercise as well as 16.8, stndents will augment their appreciation of the flexibility afforded by declarative representations, which can otherwise seem tedious.
If each aircraft generates half as much noise, the CPT for Noise is adjusted accordingly. If the noise attribute becomes three times more important, the utility table entries must all be altered. If an appropriate (e.g.. additive) representation is available, then one would only need to adjust the appropriate constants to reflect the change.
13 (30 pts) Consider the Bayesian network in the following figure and the set of independencies it implies Then answer the questions in the next page
(a) Is A independent of D ( unconditional test)? 
(b) Is A
independent of D given of {C}? )
(c) Is A independent of D given {E}?
(d)
Is A independent of D given {B, C}?
(e) Is E independent of D?
(f)
Is E independent of D given {B}?
(g) Is E independent of D given {B,F}?
(h)
What is the complete set of independencies between any two nodes in
the network? You may use a shorthand notation of ‘*’ to indicate any subset
of the remaining nodes not explicitly mentioned in the current independence
statement.
Also refer to:
Conditional independence in Bayes Nets in A Brief Introduction to Graphical Models and Bayesian Networks By Kevin Murphy:
In general, the conditional independence relationships encoded by a Bayes Net are best be explained by means of the "Bayes Ball" algorithm (due to Ross Shachter), which is as follows: Two (sets of) nodes A and B are conditionally independent (d-separated) given a set C if and only if there is no way for a ball to get from A to B in the graph, where the allowable movements of the ball are shown below. Hidden nodes are nodes whose values are not known, and are depicted as unshaded; observed nodes (the ones we condition on) are shaded. The dotted arcs indicate direction of flow of the ball.http://www.cs.berkeley.edu/~murphyk/Bayes/Figures/bayes_ball_no_det.gif
The most interesting case is the first column, when we have two arrows converging on a node X (so X is a "leaf" with two parents). If X is hidden, its parents are marginally independent, and hence the ball does not pass through (the ball being "turned around" is indicated by the curved arrows); but if X is observed, the parents become dependent, and the ball does pass through, because of the explaining away phenomenon. Notice that, if this graph was undirected, the child would always separate the parents; hence when converting a directed graph to an undirected graph, we must add links between "unmarried" parents who share a common child (i.e., "moralize" the graph) to prevent us reading off incorrect independence statements.
Now consider the second column in which we have two diverging arrows from X (so X is a "root"). If X is hidden, the children are dependent, because they have a hidden common cause, so the ball passes through. If X is observed, its children are rendered conditionally independent, so the ball does not pass through. Finally, consider the case in which we have one incoming and outgoing arrow to X. It is intuitive that the nodes upstream and downstream of X are dependent iff X is hidden, because conditioning on a node breaks the graph at that point.
[Return to Jie Bao's Homepage]