Iowa State University

Iowa State UniversityIowa State University

College of Liberal Arts and Sciences

Department of Computer Science

Ph.D. Dissertation Defense - Ankit Agrawal


Date: 05 May, 2009
Time: 11:00 AM
Location: 223 Atanasoff Hall
Topic: Sequence-specific sequence comparison using pairwise statistical significance
Major Professor(s): Xiaoqiu Huang


Abstract:

Sequence comparison is one of the most fundamental computational problems in bioinformatics for which many approaches have been and are still being developed. In particular, pairwise sequence alignment forms the crux of both DNA and protein sequence comparison techniques, which in turn forms the basis of many other applications in bioinformatics. Pairwise sequence alignment methods align two sequences using a substitution matrix consisting of pairwise scores of aligning different residues with each other (like BLOSUM62), and give an alignment score for the given sequence-pair. The biologists routinely use such pairwise alignment programs to identify similar, or more specifically, related sequences (having common ancestor). It is widely accepted that the relatedness of two sequences is better judged by statistical significance of the alignment score rather than by the alignment score alone.

This research adresses the problem of accurately estimating statistical significance of pairwise alignment for the purpose of identifying related sequences, by making the sequence comparison process more sequence-specific. The major contributions of this research work are as follows. Firstly, using sequence-specific strategies for pairwise sequence alignment in conjuction with sequence-specific strategies for statistical significance estimation, wherein accurate methods for pairwise statistical significance estimation using standard, sequence-specific, and position-specific substitution matrices are developed. Secondly, using pairwise statistical significance to improve the performance of the most popular database search program PSI-BLAST. Thirdly, design and implementation of heuristics to speed-up pairwise statistical significance estimation by a factor of more than 200. The implementation of all the methods developed in this work is freely available online.

With the all-pervasive application of sequence comparison in bioinformatics using the ever-increasing sequence data, this work is expected to offer useful contributions to the research community.