PhD Preliminary Oral Exam: Priyanka Banerjee
Computational Methods for Biological Sequence Analysis: From Functional Discovery in Metagenomes to Peptide Engineering
Biological sequence data are inherently complex due to high dimensionality, incompleteness, and limited functional understanding. Despite advances in high-throughput sequencing, extracting meaningful biological insights remains challenging. Computational and AI models have progressed rapidly, yet their application to biological sequences is hindered by the vast unknowns in biology. We investigate two types of sequence data to develop improved strategies for functional inference and design. In our first study, we analyze metagenomic data, which capture environmental DNA as fragmented sequences prone to assembly errors and missing context. To address these challenges, we developed POEM (Pipeline for Operon Exploration in Metagenomes), a computational framework that integrates machine learning with network-based modeling to reveal functional organization. Central to this approach are core operons—evolutionarily conserved, co-transcribed gene groups—reconstructed through a metagenomic functional network, providing insight into microbial community function despite incomplete data. In our second study, we focus on peptides, small proteins with antimicrobial potential. Understanding their structural and functional determinants is critical for designing novel therapeutics. We introduce PLUM (Peptide modeLs for Understanding and engineering antiMicrobial therapeutics), a generative modeling framework that learns disentangled latent representations separating sequence composition, functional activity, and peptide length. This enables precise control over peptide properties, supporting both de novo design and prototype-conditioned generation, and allowing creation of diverse, biologically relevant antimicrobial peptides. Together, these studies provide computational frameworks that account for the complexity of biological sequences, enhancing functional understanding of fragmented metagenomes and enabling controlled peptide design with desired properties, laying the groundwork for data-driven discovery in computational biology.
Committee: Oliver Eulenstein (major professor), Iddo Friedberg (major professor), Qi Li, Xiaoqiu Huang and Britta Rued