PhD Research Proficiency Exam: Isaak Daniels
Extracting the Information in the Proteome with Continuous Wavelet Transforms
Goal: Our long-term goal is to more effectively the relationships among protein sequences, in order to more accurately derive protein functions.
Background: We begin our research by examining the classes of proteases. Proteases are defined in terms of their function, which is to cleave proteins. We already have a pipeline that was able to identify twenty-one novel proteases (fourteen of which were carboxypeptidases, a specific type of protease that cleaves the carboxy end of a protein). The pipeline was composed of the protein homology tool PROST developed in our group in combination with three other homology tools (BLAST, Foldseek, MMseq) followed by alignment of the active site with a specific reference protease to validate the relationship. PROST uses the Discrete Cosine Transform (DCT) to extract top global modes, followed by L1 comparison. My contribution is combing the literature to learn whether the method is discriminating accurately between proteins that have often been confounded, such as lipases, hydrolases and epoxide hydrolases and carboxypeptidases. This led to reassessing the previous assignments and devising new ways to better discriminate.
Approach: The novel next approach aims to understand how the tools of information theory can provide protein function discrimination. The new method relies on using the mathematical tool of Continuous Wavelet Transform (CWT) and Dynamic Time Warping (DTW) to compare the features of the respective proteins that have the largest Shannon Information content. CWT, in contrast to DCT, can detect modes at various length scales and serves as a “sliding window” to generate information useful to more accurately determine whether a feature of certain length is important. Previous relevant literature will be considered for the various Wavelet Transform methods to determine motif patterns in protein sequences, by using physiochemical properties. Future directions for this project will be appropriately benchmarking it to determine whether this approach is not only theoretically defensible but can also perform better than other methods reliant on protein structures. These validations will include both entire domains, as well as short linear motifs, fitness of mutations and predictions of protein functions.
Preliminary Results: To implement the code needed to accurately calculate benchmarks will require significant time. Background literature demonstrates that the various aspects of our pipeline have significant promise.
Committee: Robert Jernigan (major professor), Jack Lutz (major professor), Xiaoqiu Huang, Ali Jannesari, Olga Zabotina and Karin Dorman