Project Summary
Ongoing transformation of biology from a
data-poor science into an increasingly data-rich science, with the attendant
increase in the number, size, and diversity of sources of data (e.g.,
protein sequences, structures, expression patterns, interactions) offer
unprecedented, and as yet, largely unrealized opportunities for large-scale
collaborative discovery in a number of areas including characterization of
macromolecular sequence-structure-function relationships, discovery of
complex genetic regulatory networks, etc.
Given
the large number, autonomous nature and the size of the relevant data
sources, gathering all of the data in a centralized location is generally
neither desirable nor feasible. Hence, there is a need for methods to
perform the necessary analysis of data where the data and the computational
resources are available and transmit the results of analysis (knowledge
acquired from the data) to where they are needed. More importantly, data
sources developed by autonomous individuals or groups differ with respect to
their ontological commitments (that is, assumptions concerning the
objects that exist in the world, the properties or
attributes of the objects, the possible values of attributes, and
their intended meaning). Therefore, semantic differences among
autonomous data sources are simply unavoidable. Because data sources that
are created for use in one context often find use in other contexts or
applications and because users often need to analyze data in different
contexts from different perspectives, there is no single privileged ontology
that can serve all users, or for that matter, even a single user, in every
context. Effective use of multiple sources of data in a given context
requires flexible approaches to reconciling such semantic differences from
the user’s point of view.
To address the information integration and
knowledge acquisition needs of collaborative scientific discovery, we have
designed INDUS (INtelligent Data Understanding System), a federated,
query-centric system for knowledge acquisition from distributed,
semantically heterogeneous data (See
Figure).
INDUS employs ontologies and
inter-ontology mappings, to enable a user or an application to view a
collection of physically distributed, autonomous, semantically heterogeneous
data sources (regardless of location, internal structure and query
interfaces) as though they were a collection of tables structured according
to an ontology supplied by the user.
This allows INDUS to answer
user queries against distributed, semantically heterogeneous data sources
without the need for a centralized data warehouse or a common global
ontology.

INDUS and the associated collection of
software tools
(a)
Support editing of ontologies and specification of semantic
relationships between ontologies (using inter-ontology mappings [Bao and
Honavar, 2004]) by users with some familiarity with the data sources, using
a graphical user interface.
(b)
Enable users to query distributed, semantically heterogeneous data
and retrieve and manipulate results in a fashion that respects the
user-imposed semantic relationships between different sources of data [Caragea
et al., 2004b].
(c)
Support construction of predictive classifiers from semantically
heterogeneous distributed data sources without having to assemble all of the
data at a central location [Caragea et al., 2004a; Caragea et al., 2004b].
This is achieved by decomposing the task of learning from data into
an information extraction task, that
formulates and sends a statistical query to a data source, and a hypothesis
generation task, that uses the resulting statistic to modify a partially
constructed hypothesis (and further invokes the information extraction
component as needed).
Acknowledgements:
This work was funded in part by grants from the National
Science Foundation (IIS 0219699) and the National Institutes of Health (GM
066387).
Top