 |
Machine Learning
Department of Computer Science
Iowa State University
|
Study Guide
Week 1 (January 10, 2005)
Overview of the course. Overview of machine learning. Why should machines learn?
Operational definition of learning. Taxonomy of machine learning.
Specification of a computational model of learning. Example: specification
and analysis of a model for conjunctive concept learning. Role of
representational and inferential bias in learning.
Review of probability theory and random variables. Probability spaces. Ontological and epistemological commitments of probabilistic representations of knowledge. Bayesian (subjective view of probability) -- Probabilities as measures of belief conditioned on the agent's knowledge. Possible world interpretation of probability. Axioms of probability. Conditional probability. Bayes theorem. Random Variables. Discrete Random Variables as functions from event spaces to Value sets. Possible world interpretation of random variables. Joint Probability distributions. Conditional Probability Distributions. Conditional Independence of Random variables. Pair-wise independence and independence. Entropy of random variables, Information, Mutual Information. Measures of distance between probability distributions -- Kullback-Liebler divergence.
Required readings
Recommended Readings
Strongly Recommended Java Readings for those unfamiliar with Java.
Week 2 (Beginning January 17, 2005)
Bayesian Decision Theory. Optimal Bayes Classifier. Minimum Risk Bayes Classifier. Naive Bayes Classifier. Maximum Likelihood Probability Estimation. Properties of Maximum Likelihood Estimators. Limitations of Maximum Likelihood Estimators. Bayesian Estimation. Conjugate Priors. Detailed treatment of Bayesian estimation in the multinomial case using Dirichlet priors. Maximum A posteriori Estimation. Representative applications of Naive Bayes classifiers - e.g., text classification.
Required readings
-
Chapter 6 from: Mitchell, T. 1997.
Machine Learning. New York: McGraw Hill.
-
Chapter 3, sections 3.1-3.6 from: Duda, R., Hart, P., and Stork, D. (2001). Pattern Recognition.
-
Chapter 5 (Statistical Concepts) from: Jordan, M. (2003). An Introduction to Probabilistic Graphical Models. Draft.
-
Lecture Slides. Vasant Honavar
-
D. D. Lewis. Naive Bayes at forty: The independence assumption in information retrieval. In ECML-98: Proceedings of the Tenth European Conference on Machine Learning, pages 4--15, Chemnitz, Germany, April 1998. Springer.
-
P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:103--130, 1997.
-
McCallum, A. and Nigam, K. A Comparison of Event Models for Naive Bayes Text Classification.. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48. Technical Report WS-98-05. AAAI Press. 1998.
-
Jason D. M. Rennie, Lawrence Shih, Jaime Teevan and David R. Karger
Tackling the Poor Assumptions of Naive Bayes Text Classifiers
Proceedings of the Twentieth International Conference on Machine Learning. 2003.
-
Susana Eyheramendy, David Lewis, David Madigan
On the Naive Bayes Model for Text Categorization.
In: Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics., Bishop, C.M. and Frey, B. (Ed). 2003.
-
Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Technical Report CMU-CS-96-118, School of Computer Science,
Carnegie Mellon University, March 1996.
-
Zhang, J. and Honavar, V. (2004). Extended version of AVT-NBL: An Algorithm for Learning Compact and Accurate Naive Bayes Classifiers from Attribute Value Taxonomies and Data In: Proceedings of the IEEE International Conference on Data Mining, Brighton, UK. To appear in Journal of Knowledge and Information Systems.
Recommended Readings
-
Langley, P., Iba, W., and Thompson, K. (1992). An Analysis of Naive Bayes Classifier. In: Proceedings of AAAI. 1992.
-
Langley, P. and Sage, S. (1999). Tractable average-case analysis of naive Bayesian classifiers.. Proceedings of the Sixteenth International Conference on Machine Learning (pp. 220-228). Bled, Slovenia: Morgan Kaufman.
-
Rish, I. An Empirical Study of Naive Bayes Classifier, In: Proc. ICML 2001.
-
Yang, Y. and G. I. Webb (2003). On Why Discretization Works for Naive-Bayes Classifiers. In Proceedings of the 16th Australian Conference on AI (AI 03)Lecture Notes AI 2903, pages 440-452. Berlin: Springer-Verlag.
-
George H. John, Pat Langley, Estimating Continuous Distributions in Bayesian Classifiers Proceedings of the 1995 Conference on Machine Learning.
Week 3 (Beginning Jan 24, 2005)
More on Bayesian estimation of probabilities from Data.
Learning Decision Tree Classifiers from Data.
Occam's razor. Brief digression into information theory.
Entropy of random variables, Information, Mutual Information. Measures of distance between probability distributions -- Kullback-Liebler divergence.
Bayesian Learning of Classifiers.
Learning Maximum a-posteriori (MAP) and Maximum Likelihood (ML) hypothesis from data. The relationship between MAP hypothesis learning, minimum description length principle (Occam's razor) and the role of priors.
Equivalence of ML hypothesis learner and consistent learner for classification tasks.
Algorithm for learning decision tree classifiers from data.
Evaluation of classifiers. Accuracy, Precision, Recall, Correlation Coefficient, ROC curves.
Required Readings
-
A Primer on Information Theory by Tom Schneider.
-
Lectures 1 and 2 from A Short Course on Information Theory by David MacKay, Cambridge University.
-
Overview of Machine Learning, Nils Nilsson.
- Chapters 3, 5 from Mitchell's Machine Learning Textbook.
- Vasant Honavar, Lecture Slides.
-
Caragea, D., Silvescu, A., and Honavar, V. (2004). A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees. International Journal of Hybrid Intelligent Systems. Invited Paper.
Vol. 1. pp. 80-89.
-
Zhang, J. and Honavar, V. (2003). Learning Decision Tree Classifiers from Attribute Value Taxonomies and Partially Specified Data. In: Proceedings of the International Conference on Machine Learning (ICML-03). Washington, DC. pp. 880-887.
-
Atramentov, A., Leiva, H., and Honavar, V. (2003).
A Multi-Relational Decision Tree Learning Algorithm - Implementation and Experiments.. In: Proceedings of the Thirteenth International Conference on Inductive Logic Programming. Berlin: Springer-Verlag. Lecture Notes in Computer Science. Vol. 2835, pp. 38-56.
-
Silvescu, A., and Honavar, V. (2001). Temporal Boolean Network Models of Genetic Networks and Their Inference from Gene Expression Time Series. Complex Systems.. Vol. 13. No. 1. pp. 54-.
-
Wang, X., Schroeder, D., Dobbs, D., and Honavar, V. (2003). Automated Data-Driven Discovery of Motif-Based Protein Function Classifiers. Information Sciences. Vol. 155. pp. 1-18. 2003.
-
Wang, H. and Zaniolo, C. (2000).
CMP: A Fast Decision Tree Classifier Using Multi-variate Predictions In: Proceedings of the International Conference on Data Engineering.
-
Domingos, P. (1999). The Role of Occam's Razor in Knowledge Discovery. Preprint.
Recommended Readings
-
Chapters 2, 4 from
Information Theory, Inference, and Learning Algorithms. David MacKay.
-
Chapter 2 from Entropy and Information Theory R.M. Gray.
-
An article on Probability Theory by Tom Loredo
-
A Mathematical Theory of Communication, C. Shannon.
-
Algorithmic Information Theory.. G. Chaitin.
-
Codrington, C. W. and Brodley, C. E., On the Qualitative Behavior of Impurity-Based Splitting Rules: The Minima-Free Property. Tech. Rep. 97-05. Dept. of Computer Science. Cornell University.
-
Brodley, C. and Utgoff, P. (1995). Multi-variate Decision Trees. Machine Learning 19: 45-77.
-
Martin, K.J. (1997). An exact Probability Metric for Decision Tree Splitting and Stopping. Machine Learning 28: 257-291.
- J. E. Gehrke, V. Ganti, R. Ramakrishnan, and W-Y Loh.
BOAT -- Optimistic Decision Tree Construction. In Proceedings of the 1999 SIGMOD Conference, Philadelphia, Pennsylvania, 1999.
-
Johannes E. Gehrke, Raghu Ramakrishnan, and Venkatesh Ganti. RAINFOREST - A Framework for Fast Decision Tree Construction of Large Datasets. Data Mining and Knowledge Discovery, Volume 4, Issue 2/3, July 2000, pp 127-162.
Week 4 (Jan 31, 2004)
Evaluation of classifiers -- estimation of performance measures; confidence interval calculation for estimates; cross-validation based estimates of hypothesis performance; leave-one-out and bootstrap estimates of performance; comparing two hypotheses; hypothesis testing; comparing two learning algorithms.
Algorithms for Learning Decision Trees (Continued).
Overfitting and methods to avoid overfitting -- dealing with small sample sizes; prepruning and post-pruning. Pitfalls of entropy as a splitting criterion for multi-valued splits. Alternative splitting strategies -- two-way versus multi-way splits; Alternative split criteria: Gini impurity, Entropy, etc. Cost-sensitive decision tree induction -- incorporating attribute measurement costs and misclassification costs into decision tree induction.
Dealing with categorical, numeric, and ordinal attributes. Dealing with missing attribute values during tree induction and instance classification;
Required readings
- Chapters 3, 5 from Mitchell's Machine Learning Textbook.
-
Fawcett, T. (2003)
ROC Graphs: Notes and Practical Considerations for Researchers HP Labs Tech. Report HPL2003-4.
-
Flach, P. (2004). Tutorial Notes. ICML-2004 Tutorial on ROC Curves. International Conference on Machine Learning, 2004.
-
Overview of Machine Learning, Nils Nilsson.
- Chapter 5 from Mitchell's Machine Learning Textbook.
- Vasant Honavar, Lecture Slides.
-
Wang, H. and Zaniolo, C. (2000).
CMP: A Fast Decision Tree Classifier Using Multi-variate Predictions In: Proceedings of the International Conference on Data Engineering.
-
Dietterich, T., Kong, E. (1995). Machine Learning Bias, Statistical Bias, Statistical Variance of Decision Tree Algorithms. Technical Report, Department of Computer Science, Oregon State University
Recommended Readings
-
Codrington, C. W. and Brodley, C. E., On the Qualitative Behavior of Impurity-Based Splitting Rules: The Minima-Free Property. Tech. Rep. 97-05. Dept. of Computer Science. Cornell University.
-
Brodley, C. and Utgoff, P. (1995). Multi-variate Decision Trees. Machine Learning 19: 45-77.
-
Martin, K.J. (1997). An exact Probability Metric for Decision Tree Splitting and Stopping. Machine Learning 28: 257-291.
-
Manish Mehta Jorma Rissanen Rakesh Agrawal (1995)
MDL-based Decision Tree Pruning.. In: Proceedings of KDD-95.
-
Bradford, C. Kunz, R. Kohavi, C. Brunk, and C.E. Brodley. Pruning decision trees with misclassification costs. In Proceedings of Tenth European Conference on Machine Learning (ECML-98), pages 131-136, Berlin, 1998. 95
-
Elomaa, T. and Rousu, J. (1999). General and Efficient Multisplitting of Numerical Attributes.. Machine Learning. 1999.
-
W.-Y. Loh and Y.-S. Shih, Split selection methods for classification trees, Statistica Sinica, vol. 7, pp. 815--840, 1997.
-
Mansour, Y. and McAllester, D. (2000).
Generalization Bounds for Decision Trees. In: Proceedings of COLT-2000.
-
M. Kearns and Y. Mansour.
A Fast, Bottom-Up Decision Tree Pruning Algorithm with Near-Optimal Generalization. In: Proceedings of the 15th International Conference
on Machine Learning, 1998, Morgan Kaufmann.
Week 5 (Beginning February 7, 2005)
Introduction to Artificial Neural Networks. Threshold logic unit (perceptron) and the associated hypothesis space. Connection with Logic and Geometry. Weight space and pattern space representations of perceptrons. Linear separability and related concepts. Perceptron Learning algorithm and its variants. Convergence properties of perceptron algorithm. Winner-Take-All Networks.
Dual representation of Perceptrons. A learning algorithm using dual representation of perceptrons.
Advantages of working with dual representation. Nonlinear feature space mappings for learning non linear decision boundaries.
Challenges of learning non linear decision boundaries using feature space mappings - computational problem of handling high dimensional
feature spaces and the curse of dimensionality (with implications for generalization).
The Support Vector Machine (SVM) solution - Kernel functions for dealing with the computational problem.
Kernel Matrices. Kernel Functions. Properties of Kernel Matrices and Kernel Functions. How to tell a good kernel from a bad one.
How to construct kernels.
Required Readings
- Lecture Slides. (perceptrons). Vasant Honavar.
- Chapter 4 from Mitchell's Machine Learning Textbook.
- Nils Nilsson. Neural Networks.
-
Honavar, V.
Threshold Logic Units.
-
Honavar, V.
Perceptron Learning Algorithm.
-
Honavar, V.
Multi-category Generalizations of Perceptron Algorithm.
-
A Tutorial on Support Vector Machines. Nello Christianini, International Conference on Machine Learning (ICML 2001).
-
Adam J. Grove, Nick Littlestone, and Dale Schuurmans. General convergence results for linear discriminant updates. In COLT-97, pages 171--183, 1997.
-
Skolkopf, B. (2000). Statistical Learning and Kernel Methods. Technical Report MSR-TR-2000-23. Microsoft Research. 2000.
-
Graepel, T., Herbrich, R., & Williamson, R. C. (2001). From margin to sparsity. In Advances in Neural Information System Processing 13.
Recommended Readings
-
Cristianini, N. and Shawe-Taylor, J. (2000). Support Vector Machines. London: Cambridge University Press.
-
Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Classification. London: Cambridge University Press.
-
Littlestone, N., and Warmuth, M. (1994).
The weighted Majority Algorithm, Information and Computation Vol. 108: 212-261. 1994.
-
Blum A. (1995).
Empirical Support for Winnow and Weighted Majority Algorithms, In: Proceedings of the
Twelfth International Conference on Machine Learning, pages 64--72. Morgan Kaufmann, 1995.
-
Golding, A. and Roth, D. (1999).
Applying Winnow to Context-Sensitive Spelling Correction
Machine Learning, 34(1-3):107--130, 1999.
-
Yang, J., Parekh, R. and Honavar, V. (1999).
Distal: An Inter-Pattern Distance-Based Constructive Neural Network Algorithm. Intelligent Data Analysis Vol. 3, pp. 55-73. 1999.
-
Nilsson, N. J. Mathematical Foundations of Learning Machines. Palo Alto, CA: Morgan Kaufmann (1992).
-
Minsky, M. amd Papert, S. Perceptrons: Introduction to Computational Geometry. Cambridge, MA: MIT Press (1988).
-
McCulloch, W. Embodiments of Mind. Cambridge, MA: MIT Press.
Week 6 (beginning February 14, 2005)
From Kernel Machines to Support Vector Machines.
Maximal Margin Separating Hyperplanes -- Why?
Digression: Vapnik-Chervonenkis (VC) Dimesion and its properties. VC dimension of the hypothesis space of hyperplanes.
Vapnik's bounds on Misclassification rate (error rate). Minimizing misclassification risk by maximizing margin.
Formulation of the problem of finding margin maximizing separating hyperplane as an optimization problem.
Digression: Unconstrained Optimization.
Review of Relevant Mathematics (Limits, Continuity and Differentiablity of Functions, Local Minima and Maxima, Derivatives,
Partial Derivatives, Taylor Series Approximation, Multi-Variate Taylor Series Approximation).
Introduction to Lagrange/Karush-Kuhn-Tucker Optimization Theory. Optimization
problems. Linear, quadratic, and convex optimization problems. Primal and
dual representations of optimization problems.
Convex Quadratic programming formulation of the
maximal margin separating hyperplane finding problem. Characteristics of the
maximal margin separating hyperplane.
Implementation of Support Vector Machines.
Required readings
-
Lecture Slides.. Vasant Honavar
-
A Tutorial on Optimization by Martin Osborne (read chapters 1-7).
-
Skolkopf, B. (2000). Statistical Learning and Kernel Methods. Technical Report MSR-TR-2000-23. Microsoft Research. 2000.
-
Muller, A.R., Mika. S., Ratsch, G., Tsuda, K., and Skolkopf. B. (2001). An Introduction to Kernel Methods for Pattern Classification. IEEE Transactions on Neural Networks. Vol. 12, pp. 188-201.
-
B. E. Boser, I. M. Guyon, and V. N. Vapnik.
A training algorithm for optimal margin classifiers. 5th Annual ACM Workshop on COLT, pp. 144-152, Pittsburgh, PA, 1992. ACM Press.
-
A Tutorial on Support Vector Machines. Nello Christianini, International Conference on Machine Learning (ICML 2001).
-
J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121--167, 1998.
-
M. A. Hearst, B. Schvlkopf, S. Dumais, E. Osuna, and J. Platt. Trends and controversies - support vector machines. IEEE Intelligent Systems, 13(4):18-28, 1998.
-
Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., Ares, M. Jr, and Haussler, D. Knowledge-based analysis of microarray gene expression data by using support vector machines.
Proc. Natl. Acad. Sci. USA 97: 262-267: 2000.
-
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In European Conference on Machine Learning (ECML-98), 1998.
-
Yan, C., Dobbs, D., and Honavar, V. A Two-Stage Classifier for Identification of Protein-Protein Interface Residues. Bioinformatics Vol. 20. pp. i371-i378. 2004.
-
Platt, J. Fast training of support vector machines using sequential minimal optimization.
In B. Scholkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in
Kernel Methods --- Support Vector Learning, pages 185-208, Cambridge, MA, 1999. MIT Press.
-
Leslie, C., Eskin, E., Weston, J., and Noble, W.S. (2002). Mismatch String Kernels for Protein Classification. Neural Information Processing Systems 2002.
-
Edgar Osuna, Robert Freund, and Federico Girosi. Support vector machines: Training and applications. Technical Report AIM-1602, 1997.
Recommended readings
-
S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, and K.R.K. Murthy. Improvements to platt's SMO algorithm for SVM classifier design. Technical report, Dept of CSA, IISc, Bangalore, India, 1999.
-
S.S. Keerthi and E.G. Gilbert, Convergence of a generalized SMO algorithm for SVM classifier design, Technical Report CD-00-01, Dept. of Mechanical and Production Eng., National University of Singapore, 2000.
-
S.S. Keerthi, S. Shevade, C. Bhattacharya and K. Murthy. A fast iterative nearest point algorithm for support vector machine classifier design. (Technical Report TR-ISL-99-03). Department of Compiuter Science and Automation. Indian Instiitute of Science, Bangalore, India, 1999.
-
Yi Li and Philip M. Long, The Relaxed Online Maximum Margin Algorithm, Machine Learning Vol. 46,
pp. 361-, 2002.
-
J. Platt, N. Cristianini, J. Shawe-Taylor, Large Margin DAGs for Multiclass Classification, in: Advances in Neural
Information Processing Systems 12, pp. 547-553, MIT Press, (2000).
Week 7 (Beginning February 21, 2005).
Bayesian Framework for classification revisited. Naive Bayes classifier as a generative model. Relationship between generative models and linear classifiers. Additional examples of generative models. Generative models from the exponential family of distributions. Generative models versus discriminative models for classification. Derivation of gradient-based discriminative models directly for classification.
Required readings
-
Lecture Slides.. Vasant Honavar
-
Chapters 7 and 8 from Jordan, M. (2005). An Introduction to Probabilistic Graphical Models. (Available on reserve at the library).
-
Rubinstein, D and Hastie, T. Discriminative vs Informative Learning. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1997.
-
Mitchell, T. (2005).
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression, Draft book chapter.
-
Bouchard, G. and Triggs, B. (2004). The tradeoff between Generative and Discriminative Classifiers, Proceedings of Computational Statistics (Compstat 04).
-
Raina, R., Shen, Y., Ng, A., and McCallum, A. (2003). Classification with Hybrid Generative/Discriminative Models. In Proceedings of the IEEE Conference on Neural Information Systems (NIPS 2003).
Recommended Readings
Week 8 (Beginning February 28 2004)
Bayesian Networks. Conditional Independence Revisited, d-separation, and compact representation of joint probability distribution functions in Bayes Networks.
Reasoning with Bayes Networks, Some algorithms for Exact Inference, and Approximate Inference of relevant probabilities from a Bayesian network using stochastic simulation (Sampling).
Required readings
Recommended Readings
Week 9 (Beginning March 7, 2005)
Learning Bayesian Networks from Data. Learning of parameters (conditional probability tables) from fully specified instances (when no attribute values are missing) in a network of known structure (review).
Learning Bayesian networks with unknown structure -- scoring functions for structure discovery,
searching the space of network topologies using scoring functions to guide the search, structure learning in practice,
Bayesian approach to structure discovery, examples.
Learning Bayesian network parameters in the presence of missing attribute values (using Expectation Maximization)
when the structure is known; Learning networks of unknown structure in the presence of missing attribute values.
Required readings
-
Chapter 6 from: Mitchell, T. 1997.
Machine Learning. New York: McGraw Hill.
-
Lecture Slides. Vasant Honavar
-
A Tutorial on Learning with Bayesian Networks, David Heckerman. Tech. Rep. MSR-TR-95-06. Microsoft Research.
-
Approximating Discrete Probability Distributions with Dependence Trees. Chou, C.K. and Liu, C.N. IEEE Transactions on Information Theory. 14(3), 1968. pp. 462-467.
-
Learning Bayesian belief networks: An approach based on the MDL principle., W. Lam and F. Bacchus, Computational Intelligence, 10(4), 1994.
-
Bayesian Network Classifiers Friedman, N., Geiger, D., and Goldszmidt, M. Machine Learning 29: pp. 131-163. 1997.
-
Learning Compact and Accurate Naive Bayes Classifiers from Attribute Value Taxonomies and Data, Zhang, J. Kang, D-K., Silvescu, A. and Honavar, V. Knowledge and Information Systems. In press. 2005.
-
An Introduction to MCMC for Machine Learning, Andrieu et al., Machine Learning, 2001.
-
Being Bayesian about Network Structure - A Bayesian Approach To Structure Discovery in Bayesian Networks N. Friedman and D. Koller. Machine Learning Vol. 50. pp. 95-125. 2003.
-
Tractable Learning of Large Bayes Net Structures from Sparse Data, Goldernberg, A. and Moore, A. (2004). In Proceedings of the International Conference on Machine Learning, 2004.
-
-
Learning Bayesian Network Structure from Massive Data Sets - The Sparse Candidate Algorithm In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI) 1999.
-
Inferring Cellular Networks Using Probabilistic Graphical Models Science Vol. 303, pp. 799-805.
Recommended Readings
-
Learning Bayesian Belief Networks Based on the Minimum Description Length Principle: Basic Properties.J. Suzuki, IEICE Transactions on Fundamentals, vol. E82, No. 10., pp. 2237 2245
-
Comparing Model Selection Criteria for Belief Networks Tim Van Allen, Russ Greiner, 2000.
-
Module Networks: Identifying Regulatory Modules and their Condition-Specific Regulators from Gene Expression Data, Segal, E., Shapira, M., Regev, A., Pe'er, D., Botstein, D., Koller, D., and Friedman, N. Nature Genetics Vol. 34, pp. 166-176. 2003.
-
Learning Bayesian Network Classifiers for Credit Scoring Using Markov Chain Monte Carlo Search, B. Baesens et al., 2001.
-
Operations for Learning with Graphical Models Wray Buntine. Journal of Artificial Intelligence Research. Vol. 2. pp. 159-225. 1994.
-
SPRING BREAK
Week 10 (Beginning March 21 2004)
Topics in Computational Learning Theory
Mistake bound analysis of learning algorithms. Mistake bound analysis of online algorithms for learning Conjunctive Concepts. Optimal Mistake Bounds. Version Space Halving Algorithm. Randomized Halving Algorithm. Learning monotone disjunctions in the presence of irrelevant attributes -- the Winnow and Balanced Winnow Algorithms. Multiplicative Update Algorithms for concept learning and function approximation. Weighted majority algorithm. Applications.
Probably Approximately Correct (PAC) Learning Model. Efficient PAC learnability. Sample Complexity of PAC Learning in terms of cardinality of hypothesis space (for finite hypothesis classes). Some Concept Classes that are easy to learn within the PAC setting.
Required readings
-
Lecture Slides. V. Honavar
-
Chapter 7, Mitchell, T. 1997. Machine Learning. New York: McGraw Hill.
-
The weighted Majority Algorithm, Littlestone, N., and Warmuth, M. Information and Computation Vol. 108: 212-261. 1994.
-
Empirical Support for Winnow and Weighted Majority Algorithms, Blum, A. In: Proceedings of the
Twelfth International Conference on Machine Learning, pages 64--72. Morgan Kaufmann, 1995.
-
Applying Winnow to Context-Sensitive Spelling Correction Golding, A., and Roth, D.
Machine Learning, 34(1-3):107--130, 1999.
-
M.H. Yang, D. Roth, and N. Ahuja. A SNoW-based face detector. NIPS(12), pages 855-861, 2000.
-
-
A Tutorial on Computational Learning Theory. V. Honavar
-
Overview of the Probably Approximately Correct (PAC) Learning Framework. D. Haussler, 1995.
Recommended Readings
-
Computational Learning Theory Sally Goldman.
-
A.J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. In Proc. 10th Annu. Conf. on Comput. Learning Theory, pages 171--183, 1997.
-
Tong Zhang. Regularized winnow methods. In Advances in Neural Information Processing Systems 13, pages 703-709, 2001.
-
Helmbold, D.P., Schapire, R.E., Singer, Y., Warmuth, M.K. On-line portfolio selection using multiplicative updates. Mathematical Finance, vol. 8 (4), pp.325-347, 1998.
Week 11 (Beginning March 28 2005)
Efficiently PAC learnable concept classes. Sufficient conditions for efficient PAC learnability. Some concept classes that are not efficiently learnable in the PAC setting. Making hard-to-learn concept classes efficiently learnable -- transforming instance representation and hypothesis representation. Occam Learning Algorithms. PAC Learnability of infinite concept classes. Vapnik-Chervonenkis (VC) dimension. Properties of VC dimension, VC dimension and learnability, Learning from Noisy examples, Transforming weak learners into PAC learners through accuracy and confidence boosting, Learning under helpful distributions - Kolmogorov Complexity, Conditional Kolmogorov Complexity, Universal distributions, Learning Simple Concepts, Learning from Simple Examples
Required readings
-
Lecture Slides. V. Honavar
-
Chapter 7, Mitchell, T. 1997. Machine Learning. New York: McGraw Hill.
-
Overview of the Probably Approximately Correct (PAC) Learning Framework. D. Haussler, 1995.
-
Kearns, M. 1998. Efficient Noise Tolerant Learning from Statistical Queries. Journal of the ACM. Vol. 45, pp. 983-1006.
-
Parekh, R. and Honavar, V. (2000). On the Relationships between Models of Learning in Helpful Environments. In: Proceedings of the Fifth International Conference on Grammatical Inference. Lisbon, Portugal.
-
Parekh, R. and Honavar, V. (2001). DFA Learning from Simple Examples. Machine Learning. Vol. 44. pp. 9-35.
Recommended Readings
-
Computational Learning Theory Sally Goldman.
-
Cesa-Bianchi, N., Dichterman, E., Fischer, P., Shamir, E., Simon, H. 1999. Sample-Efficient Strategies for Learning in the Presence of Noise. Journal of the ACM. Vol. 46. pp. 684-719.
-
Goldreich, O. and Goldwasser, S. 1998. Property testing and its connection to Learning and approximation. Journal of the ACM. Vol. 45. pp. 653-750.
-
Khardon, R. and Roth, D. 1997. Learning to Reason. Journal of the ACM. Vol, 44. pp. 697-725.
-
Valiant, L. 2000. A Neuroidal Architecture for Cognitive Computation. Journal of the ACM. Vol. 47. pp. 854-882.
-
Maass, W. 1994. Efficient Agnostic PAC Learning With Simple Hypotheses. . Proceedings of the Seventh Annual Conference on Computational Learning Theory. 1994. pp. 67-75.
-
Benedek. G. and Itai, A. Dominating Distributions and Learnability. In: Annual Workshop on Computational Learning Theory. 1992.
Week 12 (Beginning April 4 2005)
Ensemble Classifiers. Techniques for generating base classifiers; techniques for combining classifiers. Committee Machines and Bagging. Boosting. The Adaboost Algorithm. Theoretical performance of Adaboost. Boosting in practice. When does boosting help? Why does boosting work? Boosting and additive models. Loss function analysis. Boosting of multi-class classifiers. Boosting using classifiers that produce confidence estimates for class labels. Boosting and margin. Variants of boosting - generating classifiers by changing instance distribution; generating classifiers by using subsets of features; generating classifiers by changing the output code. Further insights into boosting.
Required readings
-
Lecture Slides. V. Honavar
-
Freund, R. (1999)
A Short Introduction to Boosting Journal of the Japanese Society for Artificial Intelligence, Vol 14, pp. 771-780. (English Translation by Naoki Abe).
-
Friedman, J., Hastie, T., and Tibshirani, R. (2000). A Statistical View of Boosting, Annals of Statistics, Vol. 35, pp. 337-407.
-
Meir, R. and Ratsch, G. (2002).
An Introduction to Boosting and Leveraging. Advanced Lectures on Machine Learning. Lecture Notes in Computer Science, pp. 118-183, Berlin: Springer-Verlag.
-
Baur, E., and Kohavi, R. (1999)
An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants Machine Learning. Vol. 36. pp. 105-142.
-
Breiman, L. (1994). Bagging Predictors. Tech. Rep. 421, Department of Statistics, University of California, Berkeley, CA.
Recommended Readings
-
Freund, R. (1996). Game Theory, Online Prediction, and Boosting In: Proceedings of the Conference on Computational Learning Theory (COLT 1996).
-
Schapire, R. and Singer, Y. (1999). Improved Boosting Algorithms Using Confidence-Rated Predictions, Machine Learning Vol. 37.
-
Ratsch, G., and Warmuth, M. (2002). Maximizing the Margin With Boosting, In: Proceedings of the Conference on Computational Learning Theory (2002).
Week 13 (beginning April 11, 2005)
Bayesian Recipe for function approximation and Least Mean Squared (LMS) Error Criterion. Introduction to neural networks as trainable function approximators. Function approximation from examples. Minimization of Error Functions. Review of Relevant Mathematics (Limits, Continuity and Differentiablity of Functions, Local Minima and Maxima, Derivatives, Partial Derivatives, Taylor Series Approximation, Multi-Variate Taylor Series Approximation)
Derivation of a Learning Rule for Minimizing Mean Squared Error Function for a Simple Linear Neuron.
Momentum modification for speeding up learning.
Introduction to neural networks for nonlinear function approximation.
Nonlinear function approximation using multi-layer neural networks.
Universal function approximation theorem. Derivation of the generalized delta rule (GDR) (the backpropagation learning algorithm).
Generalized delta rule (backpropagation algorithm) in practice - avoiding overfitting, choosing neuron activation
functions, choosing learning rate, choosing initial weights, speeding up learning, improving generalization, circumventing local minima, using domain-specific constraints (e.g., translation invariance in visual pattern recognition), exploiting hints, using neural networks for function approximation and pattern classification. Relationship between neural networks and Bayesian pattern classification. Variations -- Radial basis function networks. Learning non linear functions by searching the space of network topologies as well as weights.
Required readings
-
Mitchell, T. Chapter 4, Machine Learning.
-
Lecture Slides.. Vasant Honavar
-
Honavar, V. Function Approximation from Examples.
-
Honavar, V. Multi-layer networks.
-
Honavar, V. Radial Basis Function Networks.
-
T. G. Dietterich, H. Hild, and G. Bakiri.
A comparative study of ID3 and backpropagation for English text-to-speech mapping. In Proceedings of 7th IMLW, Austin, 1990.
Morgan Kaufmann.
-
Y. Le Cun, B. Boser, J.S. Denker, D. Henderson, R. Howard, W. Hubbard,
and L. Jackel.Handwritten digit recognition with a backpropagation neural network. In D.
Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 396--404. Morgan
Kaufmann, San Mateo, CA, 1990.
-
Thrun and T.M. Mitchell. Learning one more thing. Technical Report CMU-CS-94-184, Carnegie Mellon University, Pittsburgh, PA 15213, 1994.
-
Fahlman, S. and Lebiere, C. 1991. Cascade Correlation Architecture Technical Report CMU-CS-90-100 Carnegie Mellon University, August 1991.
-
Poggio et al. 1989. A Theory of Networks for Approximation and Learning Technical Report 1140, MIT Artificial Intelligence Laboratory, 1989.
Recommended Readings
-
Caruana, R.
Learning Many Related Tasks through backpropagation in NIPS, MIT Press, 1995.
-
R. Williams, and D. Zipser. Gradient-Based Learning Algorithms for
Recurrent Networks and
Their Computational Complexity. In Backpropagation: Theory,
Architectures, and Applications, Chauvin and Rumelhart, Eds., LEA, 1995, pp. 433-485.
-
Solomon, R. and J. L. van Hemmen (1996).
Accelerating backpropagation through dynamic self-adaptation.
Neural Networks 9 (4), 589--601.
-
Craven, M. and Shavlik, J.
Using Neural Networks for Data Mining. Future Generation Computer Systems 13:211-229.
-
R. Setiono, W.K. Leow and J.M. Zurada. Extraction of rules from artificial neural networks for nonlinear regression, IEEE Transactions on Neural Networks, 2002.
-
Poggio et al. 1995.
Regularization Theory and Neural Network Architectures Neural Computation, 7:219--269, 1995.
Week 14 (Beginning April 18, 2005)
Lazy Learning Algorithms. Instance based Learning, K-nearest neighbor classifiers, distance functions, locally weighted regression,
sample application to document classification using TFIDF representation. Relative advantages and disadvantages of lazy learning and eager
learning.
Unsupervised or self-supervised learning. Clustering. Learning Mixture Models from Data; Identifiability of Mixture Models; Maximum Likelihood approach to Mixture Model Learning -- Expectation Maximization (EM) algorithms. K-means clustering algorithm and variants. Adaptive Resonance Theory (ART) family of clustering algorithms. Distance measures, Clustering Criteria -- Intra-Cluster and Inter-Cluster distances. Hierarchical Agglomerative Clustering Algorithm. Distributional Clustering, Applications to Learning Attribute Value Taxonomies from Data, Phylogeny Construction. Latent Semantic Indexing, Principal Component Analysis and related methods.
Required readings
-
Lecture Slides. V. Honavar
-
Chapter 8 from: Mitchell, T. 1997.
Machine Learning. New York: McGraw Hill.
-
C. G. Atkeson, S. A. Schaal and Andrew W, Moore, Locally Weighted Learning, AI Review,Volume 11, Pages 11-73 (Kluwer
Publishers) 1997
-
J. Kleinberg. Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of the Twenty-ninth ACM Symposium on Theory of Computing, 1997.
in IEEE Intelligent Systems Vol. 13, No. 2. pp. 44-49.
-
Pavel Berkhin. A Survey of Clustering Algorithms 2002.
-
Why so many Clustering Algorithms?
-
ART I and Pattern Clustering. In: Proceedings of the 1988 Connectionist Models Summer School, Touretzky, D., Hinton, G., and Sejnowski, T. (Ed). Palo Alto, CA: Morgan Kaufmann.
-
Pereira, Fernando, Naftali Tishby, and Lillian Lee. 1993. Distributional clustering of English words.. Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pages 183--190.
-
Pitt, L. and Reinke, R. 1988. Criteria for Polynomial Time (Conceptual) Clustering. Machine Learning. 2:371-396, 1988.
-
Barbara, D. 2002. Requirements for Clustering Data Streams. SIGKDD Explorations. Vol. 3. pp. 23-27.
Recommended Readings
-
P.N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In Fourth ACM-SIAM Symposium on Discrete Algorithms, pages 311-- 321, January 1993.
-
Yang, J. and Honavar, V. (1998).
Feature Subset
Selection Using a Genetic Algorithm. In: Feature Extraction, Construction, and Subset Selection: A Data Mining Perspective. Motoda, H. and Liu, H. (Ed.) New York: Kluwer. 1998. In press. A shorter version of this paper appeared
in IEEE Intelligent Systems Vol. 13, No. 2. pp. 44-49.
-
S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. JACM: Journal of the ACM, 45, 1998.
-
A. Hinneburg, C. Aggarwal, and D. Keim (2000). What is the nearest neighbor in high dimensional spaces?, International Conference on Very Large Data Bases, Cairo, Egypt, 2000, pages 506-515.
-
Escobar, M.D., and West, M. (1995), Bayesian Density Estimation and Inference Using Mixtures Journal of the American Statistical Association, 90, 577-588.
-
Mishra, N., Oblinger, D., and Pitt, L. Sublinear Time Approximate Clustering.. 12th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 439-447, January, 2001.
-
Ramgopal R. Mettu and C. Greg Plaxton. Optimal Time Bounds for Approximate Clustering. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, August 2002, pages 344-351.
-
P. Indyk, 1999. Sublinear Time Algorithms for Metric Space Problems, Symposium on Theory of Computing (STOC '99).
-
N. Slonim and N. Tishby. Agglomerative information bottleneck.. In Proc. of NIPS-12, pages 617-623. MIT Press, 2000.
-
Slonim, N. and Tishby, N. (2000) Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd International Conference on Research and Development in Information Retrieval (SIGIR), pp. 208--215.
-
T. Hofmann, Probabilistic latent semantic indexing, in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 50--57.
Week 15 (April 25, 2004)
Reinforcement Learning. Agents that learn by exploration of
environments, using environmental reward and punishment. Examples of
reinforcement learning problems. Credit assignment problem and its implications. Exploration-exploitation dilemma.
Some approaches to exploration-exploitation tradeoffs. Markov decision processes. Learning optimal policies from
interaction with the environment -- determistic, stochastic, stationary and non-stationary
environments. Value functions and Action-Value functions. Bellman equations and dynamic programming
approach to learning optimal policies when the transition function and reward functions are known (i.e.,
the agent has a model of the environment). Q learning algorithm for learning optimal policies when an
accurate model of the environment is not known. Temporal difference methods. Scaling up reinforcement learning to large
state spaces -- function approximation methods for compact representation of actio-value functions; state abstraction methods
for hierarchical reinforcement learning. Multi-agent reinforcement learning. Applications.
Review, Summary of the Course, and Discussion of Some Current Research Problems
Required readings
- Chapter 13 from the Mitchell Text.
-
Lecture Slides. V. Honavar
-
Harmon, M. and Harmon, S. (1997). Reinforcement
Learning: A Tutorial.
- Kaelbling, L., Littman, M., and Moore, A. (1996). Reinforcement
Learning: A Survey.
-
Singh, S., Littman, M., Jong, N.K., Pardo, D., and Stone, P. (2003). Learning predictive state representations. In: Proceedings of the International Conference on Machine Learning (ICML 2003).
-
J. Rennie and A. McCallum. Using Reinforcement Learning to Spider the Web Efficiently. Proceedings of the Sixteenth International Conference on Machine Learning, 1999.
-
Singh, S., Barto, A., and Chentanez, N. (2005). Intrinsically motivated reinforcement learning. In: Advances in Neural Information Processing Systems (NIPS 2005).
-
Barto, A., and Mahadevan, S. Hierarchical Reinforcement Learning. Draft.
-
Schultz, W, Dayan, P and Montague, PR (1997). A neural substrate of prediction and reward. Science, 275, 1593-1599.
Recommended readings
- Sutton, R. and Barto, A. (1998). Reinforcement
Learning: An Introduction Cambridge, MA: MIT Press.
-
Mahadevan, S. Spatio-Temporal Abstraction of Stochastic Sequential Processes.
-
M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. . In Proc. of the 15th Int. Conf. on Machine Learning, pages 260--268. Morgan Kaufmann, 1998.
-
P. Marbach, O. Mihatsch, and J. N. Tsisiklis, Call admission control and routing in integrated services networks using neuro-dynamic programming,. IEEE Trans. Veh. Technol., vol. 18, no. 2, pp. 1972.
-
T. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. JAIR, 13:227-303, 2000.
-
J. Baxter and P. Bartlett. Direct gradient-based reinforcement learning. Technical report, Australian National University, Research School of Information Sciences and Engineering, July 1999.
-
J. Baxter and P. Bartlett. Direct gradient-based reinforcement learning. Technical report, Australian National University, Research School of Information Sciences and Engineering, July 1999.
-
Bartlett and J. Baxter. Estimation and approximation bounds for gradient-based reinforcement learning. Journal of Computer and Systems Sciences, 2002.
-
Mark Humphrys. Action Selection methods using Reinforcement Learning. PhD thesis, University of Cambridge, June 1997.
-
Even-Dar, E. and Mansour, Y. (2001). Convergence of Optimistic and Incremental Q- Learning. NIPS 2001.
-
D. Precup, R. S. Sutton, and S. Singh. Theoretical results on reinforcement learning with temporally abstract options.. In Proceedings of the 10th European Conference on Machine Learning, ECML-98, pages 382--393. Springer Verlag, 1998.
Additional Information
-
AAAI Machine Learning Topics Page
-
Jaynes, E.T. Probability Theory: The Logic of Science, Cambridge University Press, 2003.
-
Cox, R.T. The Algebra of Probable Inference, The Johns Hopkins Press, 1961.
-
Boole, G. The Laws of Thought, (First published: 1854). Prometheus Books, 2003.
-
Feller, W. An Introduction to Probability Theory and its Applications. Vols 1, 2. New York: Wiley. 1968.
-
Russell, S. and Norvig, P. 2003. Artificial Intelligence: A Modern Approach. Prentice Hall.
-
Duda, R., Hart, R., And Stork, D. (2000). Pattern Recognition. Wiley.
-
Langley, P. (1995). Elements of Machine Learning. Palo Alto, CA: Morgan Kaufmann.
-
Bishop, C. M. Neural Networks for Pattern Recognition. New York: Oxford University Press (1995).
-
Baldi, P. and Brunak, S. (2003). Bioinformatics - A Machine Learning Approach. Cambridge, MA: MIT Press.
- Cohen, P.R.Empirical Methods in Artificial Intelligence
-
Chakrabarti, S. (2003). Mining the Web, Morgan Kaufmann.
-
Baldi, P., Frasconi, P., Smyth, P. (2003). Modeling the Internet and the Web - Probabilistic Methods and Algorithms. New York: Wiley.
-
Quinlan, R. (1993). C4.5: Programs for Machine Learning. Palo Alto, CA: Morgan Kaufmann.
-
Sestito, S. and Dillon, T. (1994). Automated Knowledge Acquisition.
Sydney, Australia: Prentice-Hall.
-
Gallant, S. Neural Network Learning and Expert Systems. Cambridge, MA: MIT Press. 1993.
-
Theodoridis, S., and Konstantinos, K. Pattern Recognition. Elsevier. 2003.
-
Pearl, J. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1990.
-
Ripley, B. (1995). Pattern Recognition and Neural Networks. Cambridge University Press.
-
Fukunaga, K. Introduction to Statistical Pattern Recognition. New York: Academic Press (1990).
-
Jensen, F. Bayesian Networks and Decision Graphs. Berlin: Springer-Verlag. 2001.
-
Hastie, T., Tibshirani, R., Friedman, J. (2001).
The Elements of Statistical Learning : Data Mining, Inference, and Prediction.
-
Casella and Berger (2001).
Statistical Inference, New York: Duxbury Press.
-
Kearns, M. J. & Vazirani, U. V. An Introduction to Computational Learning Theory. Cambridge, MA: MIT Press. (1994).
-
Natarajan, B. Machine Learning: A Theoretical Approach. Morgan Kaufmann. 1992.
-
Boyd, S. and Vanderberghe, L. (2004). Convex Optimization Cambridge University Press.
-
Machine Learning Resources by David Aha.
-
WEKA Machine Learning Algorithms in Java
If you would like to be alerted by email when this page is updated, please register with Changedetect.
Copyright © 1999-2005, Vasant Honavar, Department of Computer Science, Iowa State University. All rights reserved.
Dr. Vasant Honavar
Professor
Department of Computer Science
Iowa State University
Atanasoff Hall, Ames, IA 50011-1040 USA
phone: +1-515-294-4377, fax: +1-515-294-0258