" />

Iowa State University

Iowa State UniversityIowa State University
Machine Learning: Weekly Study Guide

Department of Computer Science

Weekly Study Guide - Spring 2007


Week 1 (January 8, 2007)


Overview of the course. Overview of machine learning. Why should machines learn? Operational definition of learning. Taxonomy of machine learning. Specification of a computational model of learning. Example: specification and analysis of a model for conjunctive concept learning. Role of representational and inferential bias in learning.

Review of probability theory and random variables. Probability spaces. Ontological and epistemological commitments of probabilistic representations of knowledge. Bayesian (subjective view of probability) -- Probabilities as measures of belief conditioned on the agent's knowledge. Possible world interpretation of probability. Axioms of probability. Conditional probability. Bayes theorem. Random Variables. Discrete Random Variables as functions from event spaces to Value sets. Possible world interpretation of random variables.

Required readings

Recommended Readings

Strongly Recommended Java Readings for those unfamiliar with Java.

Additional Information

  • AAAI Machine Learning Topics Page
  • Jaynes, E.T. Probability Theory: The Logic of Science, Cambridge University Press, 2003.
  • Cox, R.T. The Algebra of Probable Inference, The Johns Hopkins Press, 1961.
  • Boole, G. The Laws of Thought, (First published: 1854). Prometheus Books, 2003.
  • Feller, W. An Introduction to Probability Theory and its Applications. Vols 1, 2. New York: Wiley. 1968.
  • Russell, S. and Norvig, P. 2003. Artificial Intelligence: A Modern Approach. Prentice Hall.


Week 2 (Beginning January 15, 2007)

Review of probability theory, random variables, and related topics (continued). Joint Probability distributions. Conditional Probability Distributions. Conditional Independence of Random variables. Pair-wise independence and independence.

Bayesian Decision Theory. Optimal Bayes Classifier. Minimum Risk Bayes Classifier.

Required readings

  • Chapter 6 from: Mitchell, T. 1997. Machine Learning. New York: McGraw Hill.
  • Chapter 1, Section 1.5 of C. Bishop (2006). Pattern Recognition and Machine Learning OR Chapter 3, sections 3.1-3.6 from: Duda, R., Hart, P., and Stork, D. (2001). Pattern Recognition.
  • Chapter 13, Russell and Norvig (2004). Artificial Intelligence: A Modern Approach. Englewood Cliffs, NJ: Prentice-Hall.
  • Lecture Slides. Vasant Honavar
Recommended Readings
  • Chapter 5 (Statistical Concepts) from: Jordan, M. (2003). An Introduction to Probabilistic Graphical Models. Draft.
Additional Information
  • AAAI Machine Learning Topics Page
  • Cox, R.T. The Algebra of Probable Inference, The Johns Hopkins Press, 1961.
  • Boole, G. The Laws of Thought, (First published: 1854). Prometheus Books, 2003.
  • Jeffrey, R. C. The Logic of Decision, Chicago: University of Chicago Press.
  • Duda, R., Hart, P., and Stork, D. (2001). Pattern Recognition.
  • Russell and Norvig (2004). Artificial Intelligence: A Modern Approach. Englewood Cliffs, NJ: Prentice-Hall.


    Week 3 (Beginning January 22, 2007)

    Introduction to Generative Models. Naive Bayes Classifier. Maximum Likelihood Probability Estimation. Properties of Maximum Likelihood Estimators. Limitations of Maximum Likelihood Estimators. Bayesian Estimation. Conjugate Priors. Detailed treatment of Bayesian estimation in the multinomial case using Dirichlet priors. Maximum A posteriori Estimation. Representative applications of Naive Bayes classifiers.

    Required readings

    Recommended Readings

    Additional Information


    Week 4 (Beginning Jan 29 2007)

    Modeling dependence between attributes. The decision tree classifier. Introduction to information theory. Information, entropy, mutual information, and related concepts (Kullback-Liebler divergence).

    Learning hypothesis from data revisited. Learning Maximum a-posteriori (MAP) and Maximum Likelihood (ML) hypothesis from data. The relationship between MAP hypothesis learning, minimum description length principle (Occam's razor) and the role of priors. Equivalence of ML hypothesis learner and consistent learner for classification tasks. Algorithm for learning decision tree classifiers from data.

    Overfitting and methods to avoid overfitting -- dealing with small sample sizes; prepruning and post-pruning. Pitfalls of entropy as a splitting criterion for multi-valued splits. Alternative splitting strategies -- two-way versus multi-way splits; Alternative split criteria: Gini impurity, Entropy, etc. Cost-sensitive decision tree induction -- incorporating attribute measurement costs and misclassification costs into decision tree induction. Dealing with categorical, numeric, and ordinal attributes. Dealing with missing attribute values during tree induction and instance classification.

    Required Readings

    Recommended Readings

    Additional Information


    Week 5 (Beginning February 5, 2007)

    Evaluation of classifiers. Accuracy, Precision, Recall, Correlation Coefficient, ROC curves.

    Evaluation of classifiers -- estimation of performance measures; confidence interval calculation for estimates; cross-validation based estimates of hypothesis performance; leave-one-out and bootstrap estimates of performance; comparing two hypotheses; hypothesis testing; comparing two learning algorithms.

    Introduction to Artificial Neural Networks and Linear Discriminant Functions. Threshold logic unit (perceptron) and the associated hypothesis space. Connection with Logic and Geometry. Weight space and pattern space representations of perceptrons. Linear separability and related concepts. Perceptron Learning algorithm and its variants. Convergence properties of perceptron algorithm. Winner-Take-All Networks.

    Required Readings

    Recommended Readings

    Additional Information

    • Nilsson, N. J. Mathematical Foundations of Learning Machines. Palo Alto, CA: Morgan Kaufmann (1992).
    • Minsky, M. amd Papert, S. Perceptrons: Introduction to Computational Geometry. Cambridge, MA: MIT Press (1988).
    • McCulloch, W. Embodiments of Mind. Cambridge, MA: MIT Press.


    Week 6 (Beginning February 12 2007)

    Introduction to Artificial Neural Networks and Linear Discriminant Functions (Continued). Multi-class classifiers and Winner-Take-All Networks.

    Dual representation of Perceptrons. A learning algorithm using dual representation of perceptrons.

    Linear discriminants - classification via regression, Fisher Linear discriminant functions.

    Nonlinear feature space mappings for learning non linear decision boundaries. Challenges of learning non linear decision boundaries using feature space mappings - computational problem of handling high dimensional feature spaces and the curse of dimensionality (with implications for generalization) (to be revisited).

    Generative versus Discriminative Models for Classification. Bayesian Framework for classification revisited. Naive Bayes classifier as a generative model. Relationship between generative models and linear classifiers. Additional examples of generative models. Generative models from the exponential family of distributions. Generative models versus discriminative models for classification.

    Required Readings

    Recommended Readings

    Additional Information


    Week 7 (beginning February 19, 2007)

    Summary of comparison of generative versus discriminative models. Derivation of learning algorithms for discriminative models based on the exponential family for classification.

    Review of Basics of Optimization - Maximization and Minimization of Functions. Review of Relevant Mathematics (Limits, Continuity and Differentiablity of Functions, Local Minima and Maxima, Derivatives, Partial Derivatives, Taylor Series Approximation, Multi-Variate Taylor Series Approximation).

    Derivation of gradient-based learning algorithms for discriminative models for classification, e.g., gradient-based algorithm for logistic regression, avoiding overfitting - regularized logistic regression.

    Maximum Margin Classifiers. Perceptron Classifier revisited. Challenges of learning non linear decision boundaries using feature space mappings - computational problem of handling high dimensional feature spaces and ensuring generalization (avoiding curse of dimensionality) in high dimensional feature spaces.

    Required readings

    Recommended readings

    Additional Information


    Week 8 (Feb 26, 2007)

    Maximum Margin Classifiers. The Support Vector Machine (SVM) solution - Kernel functions for dealing with the computational problem. Kernel Matrices. Kernel Functions. Properties of Kernel Matrices and Kernel Functions. How to tell a good kernel from a bad one. How to construct kernels.

    From Kernel Machines to Support Vector Machines. Maximal Margin Separating Hyperplanes -- Why?

    Digression: Vapnik-Chervonenkis (VC) Dimesion and its properties. VC dimension of the hypothesis space of hyperplanes.

    Vapnik's bounds on Misclassification rate (error rate). Minimizing misclassification risk by maximizing margin. Formulation of the problem of finding margin maximizing separating hyperplane as an optimization problem.

    Introduction to Lagrange/Karush-Kuhn-Tucker Optimization Theory. Optimization problems. Linear, quadratic, and convex optimization problems. Primal and dual representations of optimization problems. Convex Quadratic programming formulation of the maximal margin separating hyperplane finding problem. Characteristics of the maximal margin separating hyperplane. Implementation of Support Vector Machines.

    Required Readings

    Recommended Readings

    Additional Information


    Week 9 (Beginning March 5 2007)

    Topics in Computational Learning Theory

    Mistake bound analysis of learning algorithms. Mistake bound analysis of online algorithms for learning Conjunctive Concepts. Optimal Mistake Bounds. Version Space Halving Algorithm. Randomized Halving Algorithm. Learning monotone disjunctions in the presence of irrelevant attributes -- the Winnow and Balanced Winnow Algorithms. Multiplicative Update Algorithms for concept learning and function approximation. Weighted majority algorithm. Applications.

    Required readings

    Recommended Readings

    Additional Information

    • Kearns, M. and Vazirani, U. (1994). An Introduction to Computational Learning Theory, MIT Press.


    Week 10 (Beginning March 12, 2007) Spring Break


    Week 11 (Beginning March 19, 2007)

    Probably Approximately Correct (PAC) Learning Model. Efficient PAC learnability. Sample Complexity of PAC Learning in terms of cardinality of hypothesis space (for finite hypothesis classes). Some Concept Classes that are easy to learn within the PAC setting.

    Efficiently PAC learnable concept classes. Sufficient conditions for efficient PAC learnability. Some concept classes that are not efficiently learnable in the PAC setting. Making hard-to-learn concept classes efficiently learnable -- transforming instance representation and hypothesis representation. Occam Learning Algorithms. PAC Learnability of infinite concept classes. Vapnik-Chervonenkis (VC) dimension. Properties of VC dimension, VC dimension and learnability, Learning from Noisy examples, Transforming weak learners into PAC learners through accuracy and confidence boosting, Learning under helpful distributions - Kolmogorov Complexity, Conditional Kolmogorov Complexity, Universal distributions, Learning Simple Concepts, Learning from Simple Examples

    Required readings

    Recommended Readings

    Additional Information


    Week 12 (Beginning March 26, 2007)

    Ensemble Classifiers. Techniques for generating base classifiers; techniques for combining classifiers. Committee Machines and Bagging. Boosting. The Adaboost Algorithm. Theoretical performance of Adaboost. Boosting in practice. When does boosting help? Why does boosting work? Boosting and additive models. Loss function analysis. Boosting of multi-class classifiers. Boosting using classifiers that produce confidence estimates for class labels. Boosting and margin. Variants of boosting - generating classifiers by changing instance distribution; generating classifiers by using subsets of features; generating classifiers by changing the output code. Further insights into boosting.

    Required readings

    Recommended Readings


    Week 13 (Beginning April 2, 2007)

    Probabilistic Graphical Models. Review of Graph Theory. Bayesian Networks.

    Independence and Conditional Independence. Exploiting independence relations for compact representation of probability distributions. Introduction to Bayesian Networks. Semantics of Bayesian Networks. D-separation. D-separation examples. Answering Independence Queries Using D-Separation tests. Probabilistic Inference Using Bayesian Networks. Types of Bayesian Network Inference. Diagnostic Inference, Causal Inference, Inter-causal Inference, and Mixed Inference.

    Baysian Networks (continued). Exact Inference Algorithms - Variable Elimination Algorithm; Message Passing Algorithm; Junction Tree Algorithm. Complexity of Exact Bayesian Network Inference. Approximate inference using stochastic simulation (sampling, rejection sampling, and liklihood weighted sampling

    Required readings

    Recommended Readings Additional Information
    • Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Judea Pearl (1997).
    • Castillo, E., Gutierrez, J. M., Hadi, A. S. Expert Systems and Probabilistic Network Models. Berlin: Springer (1996).
    • Cowell, R. G. Lauritzen, S. L., and Spiegelhalter, D. J. Probabilistic Networks and Expert Systems Berlin: Springer (2005).
    • Korb, K.B., and Nicholson, A.E., Bayesian Artificial Intelligence, Chapman and Hall (2004).


    Week 14 (Beginning April 9 2007)

    Learning Bayesian Networks from Data. Learning of parameters (conditional probability tables) from fully specified instances (when no attribute values are missing) in a network of known structure (review).

    Learning Bayesian networks with unknown structure -- scoring functions for structure discovery, searching the space of network topologies using scoring functions to guide the search, structure learning in practice, Bayesian approach to structure discovery, examples.

    Learning Bayesian network parameters in the presence of missing attribute values (using Expectation Maximization) when the structure is known; Learning networks of unknown structure in the presence of missing attribute values.

    Required readings

    Recommended Readings


    Week 15 (beginning April 16, 2007)

    Bayesian Recipe for function approximation and Least Mean Squared (LMS) Error Criterion. Introduction to neural networks as trainable function approximators. Function approximation from examples. Minimization of Error Functions. Derivation of a Learning Rule for Minimizing Mean Squared Error Function for a Simple Linear Neuron. Momentum modification for speeding up learning. Introduction to neural networks for nonlinear function approximation. Nonlinear function approximation using multi-layer neural networks. Universal function approximation theorem. Derivation of the generalized delta rule (GDR) (the backpropagation learning algorithm).

    Generalized delta rule (backpropagation algorithm) in practice - avoiding overfitting, choosing neuron activation functions, choosing learning rate, choosing initial weights, speeding up learning, improving generalization, circumventing local minima, using domain-specific constraints (e.g., translation invariance in visual pattern recognition), exploiting hints, using neural networks for function approximation and pattern classification. Relationship between neural networks and Bayesian pattern classification. Variations -- Radial basis function networks. Learning non linear functions by searching the space of network topologies as well as weights.

    Lazy Learning Algorithms. Instance based Learning, K-nearest neighbor classifiers, distance functions, locally weighted regression. Relative advantages and disadvantages of lazy learning and eager learning.

    Required readings

    Recommended Readings


    Week 16 (Beginning April 23, 2007)

    Clustering. Learning Mixture Models from Data; Identifiability of Mixture Models; Maximum Likelihood approach to Mixture Model Learning -- Expectation Maximization (EM) algorithms. K-means clustering algorithm and variants.

    Distance measures, Clustering Criteria -- Intra-Cluster and Inter-Cluster distances. Hierarchical Agglomerative Clustering Algorithm. Distributional Clustering, Applications to Learning Attribute Value Taxonomies from Data, Phylogeny Construction. Divisive Clustering - Spectral Clustering Algorithms. Latent Semantic Indexing.

    Required readings

    Recommended Readings