Title: Boag: Shared Data Science Infrastructure for Genomics Data
Abstract: Every day, many scientists around the world use NCBI’s non-redundant (NR) database to identify a protein sequence’s taxonomic origin and functional annotation using BLAST without a clear understanding of its contents due to its size and exponential growth. There is a need for new tools to explore the contents of large biological datasets to better understand the assumptions and limitations of the data they contain. Protein sequence data, protein functional annotation, and taxonomic assignment from NCBI’s NR database were placed into a Boag database along with a CD-HIT clustering of all these protein sequences at different similarity levels. Boag is a domain-specific language and shared data science Hadoop-based infrastructure for exploration of genomic data.
We describe the average length of protein sequences found in NR along with the most common taxonomic assignments and functional annotations. Using the CD-HIT clustering information, we show that the non-redundant (NR) database has a considerable amount of (annotation) redundancy at the 95% similarity level. These queries of the NR dataset were performed quickly using the Boag infrastructure.
Committee: Hridesh Rajan, James Reecy, Samik Basu, David Fernández-Baca, and Xiaoqiu Huang