Iowa State University

Iowa State UniversityIowa State University
Dae-Ki Kang
Artificial Intelligence Research Laboratory

Department of Computer Science

Research

Current Work:

The major goal of my research is to develop and evaluate computational methods for generating accurate and simple classifiers from various representations of data. For this goal, I have investigated three different methods:
  1. Abstraction: To learn and evaluate taxonomies that can be used for constructing abstract features and simple hypothesis
  2. Aggregation: To generate effective features for accurate classifiers by aggregation based on a bag of values
  3. Recursion: To construct a set of classifiers by iterative application of simple learning algorithms on smaller subsets of the problem
There are many different representations of data in real world. However, most machine learning algorithms assume that each instance in the data is represented in the form of a tuple of attribute values stored in one flat table. In this research, I have interested in the following common data representations.
  1. One flat table: Instance is represented as tuples of nominal attribute values, which comprise one flat table
  2. Text and Sequences: Instance is represented as a variable-length sequence of values from a finite dictionary
  3. Multi-Relational Data: The structure and relations among multiple data are represented by a schema graph of data tables
Conceptually, my research can be recapitulated as follows:

My PhD research is organized around the following four topics.

  1. Learning Taxonomy from Data: I have introduced AVT-Learner an algorithm for automated construction of attribute value taxonomies from data. The algorithm uses hierarchical agglomerative clustering (HAC) to cluster values based on the distribution of classes that co-occur with them. I have performed experiments on various benchmark data sets such as UCI data sets, text data sets, and protein sequences. The experimental results indicate that the proposed algorithm can generate classifiers that are more compact and often more accurate than those produced by standard machine learning algorithms. The results also show that the taxonomies generated by AVT-Learner are competitive with taxonomies made by human experts (in cases where such taxonomies are available).
  2. Intrusion Detection Using a Bag of System Calls: I have proposed a bag of system calls representation for intrusion detection in system call sequences and describe misuse and anomaly detection results with standard machine learning techniques on two benchmark data sets with the proposed representation. With the feature representation as input, I have compared the performance of several machine learning techniques for misuse detection and show experimental results on anomaly detection. The results show that standard machine learning and clustering techniques on simple bag of system calls representation of system call sequences is effective and often performs better than those approaches that use foreign contiguous subsequences in detecting intrusive behaviors of compromised processes.
  3. Recursive Naive Bayes Naive Bayes Learner: I have designed RNBL-MN, a recursive Naive Bayes learner which relaxes the assumption that the instances in each class can be described by a single generative model by constructing a tree of Naive Bayes (NB) classifiers for sequence classification where each individual NB classifier in the tree is based on a multinomial generative model (one for each class at each node in the tree). Contrary to previous reports by Langley in the case of a recursive NB classifier (RBC) for the data sets of which the instances are represented as tuples of nominal attribute values, we observe on protein sequence and text classification tasks, RNBL-MN substantially outperforms NB classifier. Furthermore, our experiments show that RNBL-MN outperforms C4.5 decision tree learner (using tests on sequence composition statistics as the splitting criterion) and yields accuracies that are comparable to that of support vector machine (SVM) using similar information.
  4. Learning Classifiers from Multi-Relational Data: In relational model of data, the structure and relations of data are represented by a graph of data tables. However, most machine learning algorithms assume that each instance is represented in the form of a tuple of attribute values stored in one table. This assumption of the machine learning algorithms leads bias in their feature selection and construction from data, because naive join of multiple tables causes duplication of attributes and class label. To address this problem, I devise multi-relational decision tree learner algorithm with graph search (MRDTL-GS) techniques (an enhanced version of MRDTL). MRDTL-GS uses target dependent aggregation with graph traversal techniques of relational schema.