Location: Zoom, click here to join.
Towards Data Cleaning in Large Public Biological Databases
As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. As sequencing data continues to accumulate in the online repositories, scientists can increasingly use multi-tiered data to better answer biological questions. One main challenge that the public biological repositories have is the problem of data quality of the metadata. Unfortunately, most public databases do not have methods for identifying errors in their metadata, leading to the potential for error propagation. In order to do the cleaning at the large scale, scalable infrastructure and algorithms are needed to be developed. In this thesis, we built a domain-specific language and large-scale infrastructure, called BoaG, to analyze the wealth of genomics data. We used the BoaG’s interface to reason about the provenance, frequencies, and quality of annotations. The second part of the thesis focuses on the cleaning of the public repositories at scale. Most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of taxonomic misclassification in the entire database has not been quantified. We proposed and developed an automatic approach to detect and remove the suspicious taxonomic assignments and mispredicted functional annotations. We also addressed widely used sequence clustering information of the public databases. The usefulness of clusters to explore different biological analyses has been shown for functional annotation, family classification, systems biology, structural genomics, and phylogenetic analysis. We utilized CD-HIT to cluster NR sequences at different similarity levels, i.e. 95%, 90%, 85%, down to 65%. To improve the data quality of the clusters, we removed anomalies and then provided a confidence score based on the lineage of all sequences within each cluster.
For the functional annotations, we utilized protein ontology (PRO) and Gene Ontology that are knowledge-based graphs to detect potentially mispredicted functions. Ontologies have been utilized to express knowledge. In this thesis, we leveraged them to improve the quality of the public genomics databases. We proposed a computational method that abstracts ontology graphs into a lower-dimensional network representation that makes reasoning for inconsistencies among the list of functional annotations easier. We found that the BoaG infrastructure provided fewer lines of code, reduced storage size, and provided automatic parallelization for the large-scale analyses on the NR dataset. The BoaG’s web-interface is also implemented and is made publicly available for researchers to test different hypotheses and share them among others. We have identified “29,175,336” proteins in the NR database that have more than one distinct taxonomic assignments, among which “2,238,230" (7.6%) are potentially taxonomically misclassified. We also found that the total number of potential misclassifications in clusters at 95% similarity, above the
genus level, is “3,689,089” out of 88M clusters, which are 4% of the total clusters. This percentage of misclassifications in NR has a significant impact due to the potential for error propagation in the downstream analysis. This method proposed in this thesis will be a valuable tool in cleaning up large-scale public databases. The technique we proposed could be extended to address other kinds of annotation errors of the public databases at scale.
Committee: Hridesh Rajan (major professor), Samik Basu, Xiaoqiu Huang, David Fernandez-Baca, and James Reecy