Dependable Data Driven Discovery Institute

Rajan, Aduri, Nettleton, Weber and Hegde will lead $1.5 Million NSF HDR TRIPODS Phase I Institute Grant

Data science is an important interdisciplinary field that significantly affects many aspects of the modern world, including government, industry, academia and the general public. Transdisciplinary Research in Principles of Data Science – or TRIPODS – brings together computer science, statistics, mathematics and related communities to develop theoretical foundations of data science through integrated research and training activities focused on core algorithmic, mathematical and statistical principles.
Since 2017, an interdisciplinary group of about 30 Iowa State faculty members from across campus have met on a recurring basis to delve deeper into a range of data science issues. Building on the momentum of these internal meetings, the group – led by Hridesh Rajan – has received a three-year $1.5 million award from the National Science Foundation (NSF) to work toward establishing a D4 (Dependable Data-Driven Discovery) Institute at Iowa State. The co-investigators are Pavan Aduri, Chinmay Hegde, Daniel Nettleton, and Eric Weber. Senior personnel are Michael Catanzaro, Jia (Kevin) Liu, Henry Schenck, Vinodchandran Variyam, Namrata Vaswani, Lily Wang, and Zhengyuan Zhu. Iowa State will facilitate transdisciplinary collaboration in dependable data-driven discovery, establish and sustain cross-sector collaborations, create a hub for sharing data science expertise, and educate researchers and practitioners in theoretical, applied and ELSEI (ethical, legal, social, economic, impacts) aspects of data-driven discovery. Efforts to enhance the dependability of data-driven discovery is critical because unreliable discoveries can have far-reaching and even catastrophic impacts on science and engineering initiatives and society as a whole.

The activities of the D4 Institute will have a transformative impact on the dependability of data-science lifecycles. First, the problem definition itself will have a significant impact by helping future innovations beyond academia. While the notion of dependability is well-studied in the computer-systems literature, challenges in data science push the boundary of existing knowledge into the unknown. This institute's work will define D4, and increase data science's benefit to society by providing a transformative theory of D4. The second impact will come from the process of shared vocabulary development facilitated by this institute, and its result that would encourage experts across TRIPODS disciplines and domain experts to collaborate on common goals and challenges. Third, the institute will set research directions for D4 by providing funding for foundational research, which will have a separate set of impacts. Fourth, the institute will facilitate transdisciplinary training of a diverse cadre of data scientists through activities such as the Midwest Big Data Summer School and the D4 workshop.

The project will advance the theoretical foundations of data science by fostering foundational research to enable understanding of the risks to the dependability of data-science lifecycles, to formalize the rigorous mathematical basis of the measures of dependability for data science lifecycles, and to identify mechanisms to create dependable data-science lifecycles. The project defines a risk to be a cause that can lead to failures in data-driven discovery, and the processes that plan for, acquire, manage, analyze, and infer from data collectively as the data-science lifecycle. For instance, an inference procedure that is significantly expensive can deliver late information to a human operator facing a deadline (complexity as a risk); if the data-science lifecycle provides a recommendation without an uncertainty measure for the recommendation, a human operator has no means to determine whether to trust the recommendation (uncertainty as a risk). Compared to recent works that have focused on fairness, accountability, and trustworthiness issues for machine learning algorithms, this project will take a holistic perspective and consider the entire data-science lifecycle. In phase I of the project the investigators will focus on four measures: complexity, resource constraints, uncertainty, and data freshness. In developing a framework to study these measures, this work will prepare the investigators to scale up their activities to other measures in phase II as well as to address larger portions of the data-science lifecycle. The study of each measure brings about foundational challenges that will require expertise from multiple TRIPODS disciplines to address.