![]() |
Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University |
Automata Induction, Grammar Inference, and Language Acquisition
Personnel
Project Summary
Funding
Publications
Additional Information
Projects
AI Lab
Grammatical Inference, variously refered to as automata induction, grammar induction, and automatic language acquisition, refers to the process of learning of grammars and languages from data. Machine learning of grammars finds a variety of applications in syntactic pattern recognition, adaptive intelligent agents, diagnosis, computational biology, systems modelling, prediction, natural language acquisition, data mining and knowledge discovery.
Regular grammars are the simplest class of formal grammars in the Chomsky hierarrchy. An understanding of the issues and problems encountered in learning regular languages (or equivalently, identification of the corresponding DFA) are therefore likely to provide insights into the problem of learning more general classes of languages. Consequently, regular grammar inference from examples has received a great deal of attention in theoretical computer science, machine learning, syntactic pattern recognition, computational learning theory, and grammar inference communities.
Exact learning of the target DFA from an arbitrary presentation of labeled examples is a hard problem. Gold showed that the problem of identifying the minimum state DFA consistent with a presentation S comprising of a finite non-empty set of positive examples S+ and possibly a finite non-empty set of negative examples S- is NP-hard. Under the standard complexity theoretic assumption P not equal to NP, Pitt and Warmuth showed that no polynomial time algorithm can be guaranteed to produce a DFA that has approximately the same number of states as the target DFA from a set of labeled examples corresponding to a DFA. In the face of these negative results, efficient learning algorithms for identification of DFA assume that additional information is provided to the learner. Trakhtenbrot and Barzdin described a polynomial time algorithm for constructing the smallest DFA consistent with a complete labeled sample i.e., a sample that includes all strings up to a particular length and the corresponding label that states whether the string is accepted by the target DFA or not. The regular positive and negative inference (RPNI) algorithm identifies in time that is polynomial in the sum of lengths of the examples, a DFA that is consistent with the given sample S. Further, as shown by Oncina and Garcia, if S is a superset of a characteristic set for the target DFA (which roughly means that each state transition is sampled at least once in some sample) then the DFA output by the RPNI algorithm is guaranteed to be equivalent to the target.
When examples are drawn at random (as in the PAC setting), results proved by Kearns and Valiant suggest that an efficient algorithm for learning DFA would entail efficient algorithms for solving problems such as breaking the RSA cryptosystem, factoring Blum integers, and detecting quadratic residues, all of which are known to be hard under standard cryptographic assumptions. Using a variant of Trakhtenbrot and Barzdin's algorithm, Lang empirically demonstrated that random DFAs are approximately learnable from a sparse uniform sample. However, exact identification of the target DFA was not possible even in the average case with a randomly drawn training sample. Several efforts have been made to study the learnability of concept classes under restricted classes of distributions. Li and Vitanyi proposed a model for PAC learning with simple examples called the simple PAC model wherein the class of distributions is restricted to simple distributions (wherein examples are drawn according to the Universal distribution where strings with low Kolmogorov complexity are sampled with a high probability).
Our work has shown (among other things) that the class of DFA that have low Kolmogorov complexity are efficiently learnable if samples are drawn according to the Universal distribution. Some directions that are currently being explored include: efficient algorithms for learning interesting subclasses of grammars; application of grammar inference to characterizing macromolecular sequence families (DNA, RNA, and protein sequences), and incremental grammar induction.
© Vasant Honavar, 1999.