xINDUS: Semantic integration of Internet bioinformatics databases

Course project proposal of Computer Science 587
Principles of Network and Distributed Programming

as a part of project INDUS

 (Primary draft, subject to change) 

Jie Bao
2003-10-07

 

Background of the problem

With the rapid development of bioinformatics projects like Human Genome Project, huge amount of data has been generated. Various bioinformatics databases has been setup and made public on the internet during the pas decade. For example, PDB database provides the protein structure information; Prosite provides protein sequence information and motifs; Swiss-Prot is a curated protein sequence database which strives to provide a high level of annotation; Gene Ontology(GO) produce a controlled vocabulary of protein that can be applied to all organisms

A bioinformatics query usually involves multiple bioinformatics databases. However, those databases are heterogeneous and distributed, which make information integration a difficult task.

Firstly, the ontologies applied by those databases are different which results in low-level interoperability. User may only use their own ontology to understand and analyze the data. So the semantic integration is required for this problem.

Secondly, those databases are autonomous and heterogeneous in their structure. Some information is hiding behind the query interface, which makes a transparent combined query impossible. Network communication and iterators and needed.

This project is an attempt to provide ontology-based semantic integration from a chosen set of online bioinformatics databases.

 

Proposed approach

The project will be focused on the following problems which related to the content of this course

1.      A set of xml based iterators to provide common interfaces to corresponding protein databases.

2.      Those iterators can communicate directly with remote database

3.      A combined query can be executed and decomposed to source databases

4.      Results from source databases are composed at the client side   

The proposed system is not aim at processing big amount data access or gathering statistics of remote database.

 

Implement plan

  1.   The project will be done on java platform
  2.  XML processing will use JAXP or xerces
  3.  Query will using XQuery or other XML based query technique
  4.  An ontology language will be used. Work is based on open source project OilEd or Protégé
  5.  Following database will be integrated: PDB, Prosite, Swiss-Prot, enzyme
  6.  A GO interfaces is provided (which is xml-based)
  7.  A query GUI
  8. Which has been done:

  1.  a XML editor based on Java swing
  2.  Prosite iterators in INDUS system 
  3.  DAML+OIL OWL implementation in OilEd
  4.  OWL implementation in

 

Work schedule

Week

Focus

Status

Week 1 Oct 5 每 11

XML Editing and parsing, with a swing-based interface

Done

Week 2 Oct 12- 18

PDB interface

Carrying on

Week 3 Oct 19- 25

Swiss-Prot interface

 

Week 4 Oct 26 每 Nov 1

Enzyme interface

 

Week 5 Nov 2 每 8

GO interface

 

Week 6 Nov 9 每 15

query decomposition based on XQuery (not a complete implementation)

 

Week 7 Nov 16 每 22

DAML+OIL description

 

Week 8 Nov 23 每 29

Query

 

Week9 Nov 30 每 Dec 6

GUI integration and testing

 

Week10 Dec 7 每 Dec10

Documentation

 

 Reference : see Developing resources to INDUS project



[Return to Jie Bao's Homepage]