Algorithms and Software for Knowledge Acquisition from Heterogeneous, Distributed, Autonomous Information Sources
The Intelligent Data Understanding System (INDUS) Project
Personnel Project Summary Funding Publications Software Talks Other Projects ISU Artificial Intelligence Research Lab Center for Computational Intelligence, Learning, and Discovery
- Dr. Vasant Honavar, Professor of Computer Science and of Bioinformatics and Computational Biology, Principal Investigator.
- Dr. Drena Dobbs, Associate Professor of Molecular, Cell, and Developmental Biology, Co-Principal Investigator.
- Doina Caragea, Ph.D. Student, Computer Science (graduated in 2004, Current position: Assistant professor of Computer Science, Kansas State University). Focus: Algorithms for learning classifiers from heterogeneous data, Efficient extraction of sufficient statistics from heterogeneous data, theoretical framework for knowledge acquisition from heterogeneous, distributed, autonomous data.
- Jun Zhang, Ph.D. Student, Computer Science (graduated in 2005, Current Position: Research Scientist, Data Mining, Fair Isaac). Focus: Learning classifiers from attribute value taxonomies and class taxonomies and partially specified data.
- Jyotishman Pathak, Ph.D. Student, Computer Science. Focus: Information integration for Knowledge Acquisition from distributed, heterogeneous data, information integration, service-oriented computing.
- Carson Andorf, Ph.D. Student, Computer Science and Bioinformatics & Computational Biology. Focus: Data-driven discovery of protein sequence-structure-function relationships, data mining approaches to design of classifiers for assigning proteins to structural or functional classes.
- Jie Bao, Ph.D. Student, Computer Science. Focus: Ontology language extensions and collaborative environments for data source description and ontology construction.
-
Neeraj Koul, Ph.D. Student, Computer Science. Focus: Query Planning, Query Decomposition, and Query Optimization for querying semantically disparate data sources from a user's point of view.
- Adrian Silvescu, Ph.D. Student, Computer Science. Discovering and learning from abstraction (attribute value groupings) and structuring (generation of structured features), learning from relational data.
- Feihong Wu, Ph.D. Student, Bioinformatics and Computational Biology. Focus: Ontology-based query-centric approaches to information integration for data-driven discovery of protein sequence-structure-function relationships.
- Oksana Yakhnenko, Ph.D. Student, Computer Science. Focus: Learning classifiers from topologically structured data.
Development of high throughput data acquisition technologies together with advances in computing, and communications have resulted in an explosive growth in the number, size, and diversity of potentially useful information sources. However, the massive size, heterogeneity, autonomy, and distributed nature of the data repositories present significant hurdles in extracting knowledge from this data. This research seeks to overcome these hurdles through the design, analysis, and implementation of:
-
Efficient distributed and cumulative learning algorithms with provable performance guarantees (relative to their centralized or batch counterparts) for knowledge acquisition from distributed data sources;
- Customizable information extraction agents that can effectively exploit domain or context-specific ontologies supplied by the users to extract the information needed for learning (e.g., statistics) from distributed data sources despite differences in query capabilities, interfaces, ontologies, and access restrictions to facilitate analysis of heterogeneous distributed data from different perspectives.
- INDUS - a test-bed for knowledge acquisition from heterogeneous distributed data in computational molecular biology (e.g., characterization of protein sequence-structure-function relationships using diverse sources of biological data).
At present, primary source of funding for this project is:
- ITR: Algorithms and Software for Knowledge Acquisition from Heterogeneous Distributed Data. National Science Foundation Vasant Honavar (PI), Dr. Drena Dobbs (Co-PI). (2002-2005). $210,000.
- Discovering Protein Sequence-Structure-Function Relationships, Biological Information Science and Technology Initiative, National Institutes of Health (2003-2007). Vasant Honavar (PI), (with Drena Dobbs and Robert Jernigan), $1,022,000.
- Pioneer Hi-Bred Graduate Fellowships in Bioinformatics and Computational Biology. Vasant Honavar (PI) (with doctoral students Adrian Silvescu and Carson Andorf). (2002-2004). $80,000.
- IBM Doctoral Research Fellowship. Vasant Honavar (with doctoral student Doina Caragea). (2003-2004). $25,000.
- Distributed Knowledge Networks, John Deere Foundation, 1999-2000. Vasant Honavar. $30,000.
- An Agent-Based Environment for Integrating and Analysing Plant Genome Databases. Pioneer Hi-Bred International, Inc. 2000-2001. Vasant Honavar, Drena Dobbs, and Les Miller. $40,000.
-
Bao, J., Hu, Z., Caragea, D., Reecy, J., and Honavar, V. A Tool for Collaborative Construction of Large Biological Ontologies. Fourth International Workshop on Biological Data Management (BIDM 2006), Krakov, Poland, IEEE Press. Vol. In press., Accepted, 2006.
-
Bao, J., Caragea, D., and Honavar, V. Towards Collaborative Environments for Ontology Construction and Sharing. Proceedings of the International Symposium on Collaborative Technologies and Systems., Las Vegas, 2006.
-
Bao, J., Caragea, D., and Honavar, V. A Distributed Tableau Algorithm for Package-based Description Logics. Proceedings of the Second International Workshop on Context Representation and Reasoning (CRR 2006), Riva del Garda, Italy, CEUR. Vol. In press., Accepted, 2006.
-
Bao, J., Caragea, D., and Honavar, V. Modular Ontologies - A Formal Investigation of Semantics and Expressivity. In Proceedings of the First Asian Semantic Web Conference, Beijing, China, Springer-Verlag. Vol. In press., Accepted, 2006.
-
Kang, D-K., Silvescu, A. and Honavar, V. RNBL-MN: A Recursive Naive Bayes Learner for Sequence Classification. Proceedings of the Tenth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2006). Lecture Notes in Computer Science., Berlin: Springer-Verlag. pp. 45-54, Accepted, 2006.
-
Pathak, J., Basu, S., and Honavar, V. Modeling Web Service Composition Using Symbolic Transition Systems. AAAI '06 Workshop on AI-Driven Technologies for Services-Oriented Computing (AI-SOC), Boston, MA, AAAI Press, Accepted, 2006.
-
Pathak, J., Basu, S., Lutz, R., and Honavar, V. MoSCoE: A Framework for Modeling Web Service Composition and Execution. IEEE Conference on Data Engineering Ph.D. Workshop, Atlanta, GA, 2006.
-
Pathak, J, Yong, J. Honavar, V., McCalley, J. Condition Data Aggregation for Failure Mode Estimation of Power Transformers. Hawaii International Conference on Systems Sciences, IEEE Computer Society. pp. 241a, 2006.
- Vasile, F., Silvescu, A., Kang, D-K., and Honavar, V. TRIPPER: An Attribute Value Taxonomy Guided Rule Learner. Proceedings of the Tenth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Berlin: Springer-Verlag. pp. 55-59, 2006.
-
Zhang, J., Kang, D-K., Silvescu, A. and Honavar, V. Learning Compact and Accurate Naive Bayes Classifiers from Attribute Value Taxonomies and Data. Knowledge and Information Systems. Vol. 9. No. 2. pp. 157-179, 2006.
-
Caragea, D., Zhang, J., Bao, J., Pathak, J., and Honavar, V. (2005). Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous Information Sources (Invited paper). In: Proceedings of the 16th International Conference on Algorithmic Learning Theory. Lecture Notes in Computer Science. Singapore. Vol. 3734. pp. 13-44. Berlin: Springer-Verlag.
-
Caragea, D., Silvescu, A., Pathak, J., Bao, J., Andorf, C., Dobbs, D., and Honavar, V. (2005). Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources. In: Data Integration in Life Sciences (DILS 2005) Springer-Verlag Lecture Notes in Computer Science. San Diego. Vol. 3615. pp. 175-190. Berlin: Springer-Verlag.
-
Caragea, D., Bao, J., Pathak, J., Andorf, C,., Dobbs, D., and Honavar, V. Information Integration from Semantically Heterogeneous Biological Data Sources. Proceedings of the Sixteenth International Workshop on Databases and Expert Systems Applications (DEXA 05), Copenhagen, IEEE Computer Society. pp. 580-584, 2005.
-
Kang, D-K., Zhang, J., Silvescu, A., and Honavar, V. Multinomial Event Model Based Abstraction for Sequence and Text Classification. Proceedings of the Symposium on Abstraction, Reformulation, and Approximation (SARA 2005), Edinburgh, UK, Berlin: Springer-Verlag. Vol. 3607. pp. 134-148, 2005.
- Kang, D-K., Fuller, D., and Honavar, V. Learning Misuse and Anomaly Detectors from System Call Frequency Vector Representation. IEEE International Conference on Intelligence and Security Informatics. Springer-Verlag Lecture Notes in Computer Science, Springer-Verlag. Vol. 3495. pp. 511-516, 2005.
-
Kang, D-K., Fuller, D., and Honavar, V. Learning Classifiers for Misuse and Anomaly Detection Using a Bag of System Calls Representation. Proceedings of the 6th IEEE Systems, Man, and Cybernetics Workshop (IAW 05), West Point, NY, IEEE. pp. 118-125, 2005.
-
Pathak, J,, Koul, N., Caragea, D., and Honavar, V. (2005). A Framework for Semantic Web Services Discovery. In: Proceedings of the 7th ACM International Workshop on Web Information and Data Management (WIDM 2005).. pp. 45-50. ACM Press.
-
Yakhnenko, O., Silvescu, A., and Honavar, V. Discriminatively Trained Markov Model for Sequence Classification. IEEE Conference on Data Mining (ICDM 2005), Houston, Texas, IEEE Press, 2005.
- Zhang, J., Caragea, D. and Honavar, V. (2005). Learning Ontology-Aware Classifiers. In: Proceedings of the 8th International Conference on Discovery Science. Springer-Verlag Lecture Notes in Computer Science. Singapore. Vol. 3735. pp. 308-321. Berlin: Springer-Verlag.
-
Caragea, D., Pathak, J., and Honavar, V. (2004). Learning Classifiers from Semantically Heterogeneous Data. In: Proceedings of the International Conference on Ontologies, Databases, and Applications of Semantics (ODBASE 2004), Agia Napa, Cyprus, 2004.
-
Bao, J., Cao, Y., Tavanapong, W., and Honavar, V. (2004). Integration of Domain-Specific and Domain-Independent Ontologies for Colonoscopy Video Database Annotation. In: International Conference on Information and Knowledge Engineering (IKE 04). CSREA Press. pp. 82-88.
- Bao, J. and Honavar, V. (2004). Collaborative Ontology Building With Wiki@nt. In: Third International Workshop on Evaluation of Ontology Building Tools. Hiroshima.
-
Caragea, D., Pathak, J. and Honavar, V. (2004). Learning Classifiers from Semantically Heterogeneous Data. In: International Conference on Ontologies, Databases, and Applications of Semantics (ODBASE 2004). Springer-Verlag Lecture Notes in Computer Science. Cyprus, Greece. Vol. 3291. pp. 963-980. Springer-Verlag.
-
Caragea, D., Silvescu, A., and Honavar, V. (2004). A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees. International Journal of Hybrid Intelligent Systems. Vol. 1. pp. 80-89.
-
Kang, D-K., Silvescu, A., Zhang, J., and Honavar, V. (2004). Generation of Attribute Value Taxonomies from Data for Data-Driven Construction of Accurate and Compact Classifiers. In: Proceedings of the IEEE International Conference on Data Mining.
-
Pathak, J., Caragea, D., and Honavar, V. (2004). Ontology-Extended Component-Based Workflows: A Framework for Constructing Complex Workflows from Semantically Heterogeneous Software Components. In: Proceedings of the Workshop on Semantic Web and Databases (SWDB-04). Springer-Verlag Lecture Notes in Computer Science. In press.
- Yan, C., Dobbs, D., and Honavar, V. (2004). A Two-Stage Classifier for Identification of Protein-Protein Interface Residues. In: Bioinformatics. Vol. 20. pp. i371-378.
-
Yan, C., Honavar, V. and Dobbs, D. (2004). Identifying Protein-Protein Interaction Sites from Surface Residues - A Support Vector Machine Approach.. Neural Computing Applications. Vol. 13. pp. 123-129.
-
Zhang, J. and Honavar, V. (2004). AVT-NBL - An Algorithm for Learning Compact and Accurate Naive Bayes Classifiers from Attribute Value Taxonomies and Data. In: Proceedings of the IEEE International Conference on Data Mining.
-
Atramentov, A., Leiva, H., and Honavar, V. (2003).
A Multi-Relational Decision Tree Learning Algorithm - Implementation and Experiments.. In: Proceedings of the Thirteenth International Conference on Inductive Logic Programming. Berlin: Springer-Verlag.
-
Caragea, D., Reinoso-Castillo, J., Silvescu, A. (2003). Statistics Gathering for Information Integration on the Web. In: Proceedings of the IJCAI-03 Workshop on Information Integration on the Web..
-
Reinoso-Castillo, J., Silvescu, A., Caragea, D., Pathak, J. and Honavar, V. (2003).
Information Extraction and Integration from Heterogeneous, Distributed, Autonomous Information Sources: A Federated, Query-Centric Approach.. IEEE International Conference on Information Integration and Reuse.
-
Zhang, J. and Honavar, V. (2003). Learning Decision Tree Classifiers from Attribute Value Taxonomies and Partially Specified Data. In: Proceedings of the International Conference on Machine Learning (ICML-03). Washington, DC. In press.
-
Reinoso-Castillo, J. (2002).
Ontolgy-Driven Information Extraction and Integration from Autonomous, Heterogeneous, Distributed Data Sources -- A Federated Query-Centric Approach. Masters Thesis. Artificial Intelligence Research Laboratory. Department of Computer Science. Iowa State University.
-
Zhang, J., Silvescu, A., and Honavar, V. (2002). Ontology-Driven Induction of Decision Trees at Multiple Levels of Abstraction. In: Proceedings of Symposium on Abstraction, Reformulation, and Approximation. Berlin: Springer-Verlag.
-
Caragea, D., Silvescu, A., and Honavar, V. (2001). Invited Chapter.
Towards a Theoretical Framework for Analysis and Synthesis of Agents That Learn from Distributed Dynamic Data Sources. In: Emerging Neural Architectures Based on Neuroscience. Berlin: Springer-Verlag.
-
Polikar, R., Udpa, L., Udpa, S., and Honavar, V. (2001). Learn++: An Incremental Learning Algorithm for Multi-Layer Perceptron Networks. IEEE Transactions on Systems, Man, and Cybernetics. Vol. 31, No. 4. pp. 497-508.
- Caragea, D., Silvescu, A., and Honavar, V. (2000). Agents That Learn from Distributed Dynamic Data Sources. In: Proceedings of the ECML 2000/Agents 2000 Workshop on Learning Agents. Barcelona, Spain.
- Honavar, V., Miller, L. and Wong, J. (1998). Distributed Knowledge Networks. In: Proceedings of the IEEE Information Technology Conference. Syracuse, NY.
Project Impact
Contributions within Discipline
The project has contributed to the development of provably sound ontology-based approaches to data integration that allow scientists to view and combine a given set of data sources from multiple ontological points of view based on the ontologies of their own choosing. The framework supports efficient extraction of sufficient statistics (e.g., counts that satisfy certain constraints on attribute values) needed for construction of classifiers under a broad range of assumptions concerning the capabilities offered by the information sources (execution of aggregate operators, execution of code supplied by the user). This work has also resulted in novel algorithms for exploiting particularly common types of ontologies -- class-subclass hierarchies and attribute-value taxonomies in learning compact, accurate, and comprehensible classifiers from semantically heterogeneous distributed data. These results collectively represent important contributions towards the realization of the Semantic Web for e-Science.
Contributions to Other Disciplines
This research has resulted in applications of data mining to two representative problems in computational molecular biology -- sequence-based prediction of protein function and identification of protein-protein interaction sites.
Contributions to Education and Human Resources
The project has, with the help of funds leveraged from other sources, has contributed to the training of several Ph.D. students in Computer Science (Doina Caragea, Adrian Silvescu, Jun Zhang, Jie Bao, Carson Andorf, Oksana Yakhnenko, Jyotishman Pathak, Dae-Ki Kang) and Bioinformatics and Computational Biology (Changhui Yan, and Mgavi Braithwaite) and two M.S. students (Jaime Reinoso-Castillo and Anna Atramentov). The project has also provided research opportunities for two undergraduate students. This research has led to the establishment of a Center for Computational Intelligence, Learning, and Discovery focused on large-scale data-driven e-Science at Iowa State University. This research has also strengthened interdisciplinary research collaborations between Vasant Honavar (a Computer Scientist), Drena Dobbs (a molecular biologist), Robert Jernigan (a biophysicist) and Heather Greenlee (a neuroscientist).
Integration of Research into Graduate and Undergraduate Curriculum
Honavar has developed and taught a module on machine learning approaches to bioinformatics based in part on the results of research supported by this award at an NSF-NIH supported Summer institute on Bioinformatics and Computational Biology for undergraduates and beginning graduate students from around the US this summer at Iowa State University. Some of the research problems and results have also been integrated into a course on machine learning.
Current Research Directions
- Extending algorithms that are being developed by our group as well as others for learning from multiple relational databases to work with semantically heterogeneous data sources, taking advantage of the capability of INDUS to view heterogeneous information sources as though they were a collection of relational databases.
- Extending the ontology-based approach to information integration to develop ontology-based frameworks for composition of autonomously developed components and services using emerging frameworks for data data source (or more generally resource or service) description, registry services, that are being developed as part of the Semantic Web efforts.
- Development of tools for collaborative ontology development, specification of semantic mappings between information sources, ontology merging, learning specific types of ontologies (e.g., attribute value taxonomies) from data.
- Extension of approaches used in INDUS to support user and context-specific information integration and knowledge acquisition in peer-to-peer environments and distributed sensor networks.
- Further development of the INDUS prototype into a platform to support exploratory data analysis and knowledge acquisition in representative problems in bioinformatics and computational biology e.g., data-driven construction of classifiers of protein function; and predictors of protein-protein interaction.
- Dissemination of INDUS and associated software to the broader scientific community.
- INDUS -- a prototype system for flexible information extraction and integration using user-supplied ontologies from heterogeneous, distributed, autonomous information sources. We plan to make the software, documentation, and illustrative examples of its use freely under open source license to the scientific community over the Internet.
- INDUS-DM -- open source suite of learning algorithms which decouple data source dependent and data source independent components of learning using sufficient statistics.
- Wiki@nt -- open source code for collaborative modular ontology development and reuse.
- AVT-DTL -- software for learning decision tree classifiers from attribute value taxonomies and data and some sample data sets and attribute value taxonomies are available for download.
- AVT-NBL -- software for learning decision naive bayes classifiers from attribute value taxonomies and data and some sample data sets and attribute value taxonomies are available for download.
- Honavar, V. and Caragea, D. Querying Semantically Heterogeneous Data Sources from a User.s Point of View. 2006 Semantic Technology Conference. San Jose, CA, March 6-9, 2006.
- Vasant Honavar, Invited Talk, Algorithms and Software for Collaborative Knowledge Acquisition from Autonomous, Distributed, Semantically Heterogeneous Information Sources 16th International Conference on Algorithmic Learning Theory (ALT 05) and 8th International Conference on Discovery Science, Singapore, October 8-11, 2005.
- Bao, J., Pathak, J. and Honavar, V. (2005). Ontology-based Information Integration using INDUS System. In: The Program of the Eight Annual Bio-Ontologies Meeting (Bio-Ont SIG 2005). Poster Session., Intelligent Systems in Molecular Biology (ISMB 2005). Detroit, Michigan.
- Caragea, D., Silvescu, A., Pathak, J., Bao, J., Andorf., C., Yan, C., Dobbs, D. and Honavar, V. (2005). Knowledge Acquisition from Autonomous, Distributed, Semantically Heterogeneous Data Sources. In: Poster Program,, Intelligent Systems in Molecular Biology (ISMB 2005), Detroit, Michigan.
-
Pathak, J., Bao, J., Caragea, D., Silvescu, A., Andorf., C., Yan, C., Dobbs, D. and Honavar, V. (2005).
INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed, and Semantically Heterogeneous Data Sources.
In: Software Demo Program, Intelligent Systems in Molecular Biology (ISMB 2005),
Detroit, Michigan.