 |
Artificial Intelligence Research Laboratory
Department of Computer Science
Iowa State University
|
The Intelligent Data Understanding System (INDUS) Project
Algorithms and Software for Knowledge Acquisition from Heterogeneous, Distributed, Autonomous Information Sources
Personnel
Project Summary
Funding
Publications
Software
Talks
Other Projects
ISU Artificial Intelligence Research Lab
Personnel
-
Dr. Vasant Honavar, Professor of Computer Science and of Bioinformatics and Computational Biology, Principal Investigator.
-
Dr. Drena Dobbs, Associate Professor of Molecular, Cell, and Developmental Biology, Co-Principal Investigator.
-
Doina Caragea, Ph.D. Student, Computer Science. Focus: Algorithms for learning classifiers from heterogeneous data, Efficient extraction of sufficient statistics from heterogeneous data, theoretical framework for knowledge acquisition from heterogeneous, distributed, autonomous data.
-
Jaime Reinoso-Castillo, Ph.D. Student. Focus: Ontology-based query-centric approaches to information extraction and integration from heterogeneous data, information integration in peer-to-peer networks, Algorithms for knowledge acquisition from relational data.
-
Jun Zhang, Ph.D. Student, Computer Science. Focus: Learning classifiers from attribute value taxonomies and class taxonomies and partially specified data.
-
Jyotishman Pathak, Ph.D. Student, Computer Science. Focus: Information integration for Knowledge Acquisition from distributed, heterogeneous data, information integration in peer-to-peer networks.
-
Carson Andorf, Ph.D. Student, Computer Science and Bioinformatics & Computational Biology. Focus: Data-driven discovery of protein sequence-structure-function relationships, data mining approaches to design of classifiers for assigning proteins to structural or functional classes.
-
Changhui Yan, Ph.D. Student, Bioinformatics & Computational Biology and Computer Science. Focus: Data-driven discovery of sequence correlates of protein-protein interaction, learning from unbalanced data sets, design of feature representations and kernel functions for prediction of protein-protein interaction sites from amino acid sequence.
-
Jie Bao, Ph.D. Student, Computer Science. Focus: Learning from structured features and data, Relational learning from heterogeneous distributed data, cooperative multiagent learning.
-
Facundo Bromberg, Ph.D. Student, Computer Science. Focus: Cooperative learning in multi-agent systems, Cooperative knowledge acquisition from heterogeneous data in peer-to-peer information networks.
-
Adrian Silvescu, Ph.D. Student, Computer Science. Discovering and learning from abstraction (attribute value groupings) and structuring (generation of structured features), learning from relational data.
-
Feihong Wu, Ph.D. Student, Bioinformatics and Computational Biology. Focus: Ontology-based query-centric approaches to information integration for data-driven discovery of protein sequence-structure-function relationships.
-
Oksana Yakhnenko, Ph.D. Student, Computer Science. Focus: Learning classifiers from topologically structured data.
Project Summary
Development of high throughput data acquisition technologies together with advances in computing, and communications have resulted in an explosive growth in the number, size, and diversity of potentially useful information sources. However, the massive size, heterogeneity, autonomy, and distributed nature of the data repositories present significant hurdles in extracting knowledge from this data. This research seeks to overcome these hurdles through the design, analysis, and implementation of:
-
Efficient distributed and cumulative learning algorithms with provable performance guarantees (relative to their centralized or batch counterparts) for knowledge acquisition from distributed data sources;
- Customizable information extraction agents that can effectively exploit domain or context-specific ontologies supplied by the users to extract the information needed for learning (e.g., statistics) from distributed data sources despite differences in query capabilities, interfaces, ontologies, and access restrictions to facilitate analysis of heterogeneous distributed data from different perspectives.
- INDUS - a test-bed for knowledge acquisition from heterogeneous distributed data in computational molecular biology (e.g., characterization of protein sequence-structure-function relationships using diverse sources of biological data).
The resulting algorithms and software can accelerate, potentially by an order of magnitude, the rate of scientific discovery in emerging data rich domains such as biological sciences. This research is closely integrated with education and training of
graduate and undergraduate students in Computer Science and Bioinformatics and Computational Biology at Iowa State University.
Funding
At present, primary source of funding for this project is:
This project has benefited from funding for related, but not overlapping work from other sources including:
-
Discovering Protein Sequence-Structure-Function Relationships, Biological Information Science and Technology Initiative, National Institutes of Health (2003-2007).
Vasant Honavar (PI), (with Drena Dobbs and Robert Jernigan),
$1,022,000.
-
Pioneer Hi-Bred Graduate Fellowships in Bioinformatics and Computational Biology.
Vasant Honavar (PI) (with doctoral students Adrian Silvescu and Carson Andorf). (2002-2004). $80,000.
-
IBM Doctoral Research Fellowship.
Vasant Honavar (with doctoral student Doina Caragea).
(2003-2004). $25,000.
In the past, some of the work leading up to this project was supported in part by:
Publications
-
Andorf, C., Dobbs, D., and Honavar, V. (2004). Discovering Protein Function Classification Rules from Reduced Alphabet Representations of Protein Sequences. Information Sciences. In press.
-
Andorf, C., Silvescu, A., Dobbs, D., and Honavar, V. (2004). Learning Classifiers for Assigning Protein Sequences to Gene Ontology (GO) Functional Families. In: Proceedings of the Fifth International Conference on Knowledge Based Computer Systems (KBCS 2004). In press.
-
Caragea, D., Pathak, J., and Honavar, V. (2004). Learning Classifiers from Semantically Heterogeneous Data. In: Proceedings of the International Conference on Ontologies, Databases, and Applications of Semantics (ODBASE 2004), Lecture Notes in Computer Science. Berlin: Springer-Verlag. 2004.
-
Bao, J., and Honavar, V. (2004). Collaborative Ontology Building Using Wiki@ant. In: Proceedings of the Workshop on Evaluation of Ontology-Based Tools (EON2004). Lecture Notes in Computer Science. Berlin: Springer-Verlag. In press.
-
Bao, J., and Honavar, V. (2004). Ontology Language Extensions to Support Localized Semantics, Modular Reasoning, and Collaborative Ontology Design and Reuse. To appear.
-
Bao, J., Cao, Y., Tavanapong, W., and Honavar, V. (2004). Integration of Domain-Specific and Domain-Independent Ontologies for Colonoscopy Video Database Annotation. In: International Conference on Information and Knowledge Engineering (IKE 04). CSREA Press. pp. 82-88.
-
Caragea, D., Silvescu, A., and Honavar, V. (2004). A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees. International Journal of Hybrid Intelligent Systems. Vol. 1. pp. 80-89.
-
Kang, D-K., Silvescu, A., Zhang, J., and Honavar, V. (2004). Generation of Attribute Value Taxonomies from Data for Data-Driven Construction of Accurate and Compact Classifiers. In: Proceedings of the IEEE International Conference on Data Mining. In press.
-
Pathak, J., Caragea, D., and Honavar, V. (2004). Ontology-Extended Component-Based Workflows: A Framework for Constructing Complex Workflows from Semantically Heterogeneous Software Components. In: Proceedings of the Workshop on Semantic Web and Databases (SWDB-04). Springer-Verlag Lecture Notes in Computer Science. In press.
-
Yan, C., Dobbs, D., and Honavar, V. (2004). A Two-Stage Classifier for Identification of Protein-Protein Interface Residues. Bioinformatics. Vol. 20. pp. i371-378. 2004.
-
Yan, C., Honavar, V. and Dobbs, D. (2004). Identifying Protein-Protein Interaction Sites from Surface Residues - A Support Vector Machine Approach.. Neural Computing Applications. Vol. 13. pp. 123-129. 2004.
-
Zhang, J. and Honavar, V. (2004). AVT-NBL - An Algorithm for Learning Compact and Accurate Naive Bayes Classifiers from Attribute Value Taxonomies and Data. In: Proceedings of the IEEE International Conference on Data Mining. In press.
-
Atramentov, A., Leiva, H., and Honavar, V. (2003).
A Multi-Relational Decision Tree Learning Algorithm - Implementation and Experiments.. In: Proceedings of the Thirteenth International Conference on Inductive Logic Programming. Berlin: Springer-Verlag.
-
Caragea, D., Silvescu, A., and Honavar, V. (2003).
Decision Tree Induction from Distributed, Heterogeneous, Autonomous Data Sources. In: Proceedings of the Conference on Intelligent Systems Design and Applications (ISDA 03). Tulsa, Oklahoma, 2003.
-
Caragea, D., Reinoso-Castillo, J., Silvescu, A. (2003). Statistics Gathering for Information Integration on the Web. In: Proceedings of the IJCAI-03 Workshop on Information Integration on the Web..
-
Reinoso-Castillo, J., Silvescu, A., Caragea, D., Pathak, J. and Honavar, V. (2003).
Information Extraction and Integration from Heterogeneous, Distributed, Autonomous Information Sources: A Federated, Query-Centric Approach.. IEEE International Conference on Information Integration and Reuse.
-
Wang, X., Schroeder, D., Dobbs, D., and Honavar, V. (2003). Automated Data-Driven Discovery of Motif-Based Protein Function Classifiers. Information Sciences. Vol. 155 pp. 1-18. 2003.
-
Yan, C., Dobbs, D., and Honavar, V. (2003). Identification of Surface Residues Involved in Protein-Protein Interaction -- A Support Vector Machine ApproachIn: Proceedings of the Conference on Intelligent Systems Design and Applications (ISDA-03). Tulsa, Oklahoma. 2003.
-
Zhang, J. and Honavar, V. (2003). Learning Decision Tree Classifiers from Attribute Value Taxonomies and Partially Specified Data. In: Proceedings of the International Conference on Machine Learning (ICML-03). Washington, DC.
-
Zhang, J., Silvescu, A., and Honavar, V. (2002). Ontology-Driven Induction of Decision Trees at Multiple Levels of Abstraction. In: Proceedings of Symposium on Abstraction, Reformulation, and Approximation. Berlin: Springer-Verlag.
Dissertations and Theses
-
Caragea, D. (2004). Learning Classifiers from Semantically Heterogeneous, Distributed, Autonomous Data Sources. Ph.D. Dissertation. Department of Computer Science. Iowa State University.
-
Reinoso-Castillo, J. (2002).
Ontolgy-Driven Information Extraction and Integration from Autonomous, Heterogeneous, Distributed Data Sources -- A Federated Query-Centric Approach. Masters Thesis. Department of Computer Science. Iowa State University.
Project Impact
Contributions within Discipline: The project has contributed to the development of provably sound ontology-based approaches to data integration that allow scientists to view and combine a given set of data sources from multiple ontological points of view based on the ontologies of their own choosing. The framework supports efficient extraction of sufficient statistics (e.g., counts that satisfy certain constraints on attribute values) needed for construction of classifiers under a broad range of assumptions concerning the capabilities offered by the information sources (execution of aggregate operators, execution of code supplied by the user). This work has also resulted in novel algorithms for exploiting particularly common types of ontologies -- class-subclass hierarchies and attribute-value taxonomies in learning compact, accurate, and comprehensible classifiers from semantically heterogeneous distributed data. These results collectively represent important contributions towards the realization of the Semantic Web for e-Science.
Contributions to Other Disciplines: This research has resulted in applications of data mining to two representative problems in computational molecular
biology -- sequence-based prediction of protein function and
identification of protein-protein interaction sites.
Contributions to Education and Human Resources:
The project has, with the help of funds leveraged from other sources,
has contributed to the training of several Ph.D. students
in Computer Science (Doina Caragea, Adrian Silvescu, Jun Zhang, Jie Bao, Carson Andorf, Oksana Yakhnenko, Oksana Kohutyuk, Jyotishman Pathak, Dae-Ki Kang) and Bioinformatics and Computational Biology
(Changhui Yan, and Mgavi Braithwaite) and two M.S. students (Jaime Reinoso-Castillo and Anna
Atramentov). The project has also provided research opportunities for two undergraduate students.
This research has led to the establishment of a collaborative research (including faculty and students from Computer Sciences, Statistics, Psychology) in Computational Intelligence, Learning, and Discovery focused on large-scale data-driven e-Science at Iowa State University
This research has also strengthened interdisciplinary research collaborations between Vasant Honavar (a Computer
Scientist), Drena Dobbs (a molecular biologist), Robert Jernigan (a biophysicist) and Heather Greenlee (a neuroscientist).
Integration of Research into Graduate and Undergraduate CurriculumHonavar taught a module
on machine learning approaches to bioinformatics based in part on the
results of research supported by this award at an NSF-NIH supported
Summer institute on Bioinformatics and Computational Biology for
undergraduates and beginning graduate students from around the US this
summer at Iowa State University. Some of the research problems and results have also been integrated into a course on machine learning.
Current Research Directions
-
Extending algorithms that are being developed by our group as well as others for learning from multiple relational databases to work with semantically heterogeneous data sources, taking advantage of the capability of INDUS to view heterogeneous information sources as though they were a collection of relational databases.
- Extending the ontology-based approach to information integration to develop ontology-based frameworks for composition of autonomously developed components and services using emerging frameworks for data data source (or more generally resource or service) description, registry services,
that are being developed as part of the Semantic Web efforts.
- Development of tools for collaborative ontology development, specification of semantic mappings between information sources, ontology merging, learning specific types of ontologies (e.g., attribute value taxonomies) from data.
- Extension of approaches used in INDUS to support user and context-specific information integration and knowledge acquisition in peer-to-peer environments and distributed sensor networks.
-
Further development of the INDUS prototype into a platform to support exploratory data analysis and knowledge acquisition in representative problems in bioinformatics and computational biology e.g., data-driven construction of classifiers of protein function; and predictors of protein-protein interaction.
-
Dissemination of INDUS and associated software to the broader scientific community.
Software
Talks and Posters
To come
Dr. Vasant Honavar
Artificial Intelligence Research Laboratory
Department of Computer Science
Iowa State University
Atanasoff Hall, Ames, IA 50011-1040 USA
phone: +1-515-294-1098, +1-515-294-4377; fax: +1-515-294-0258
© Vasant Honavar, 2004.