MS Final Oral Exam: Mahdi Banisharifdehkordi
Graph Neural Network Architectures for Interpretable I/O Bottleneck Analysis in High-Performance Computing Systems
The increasing complexity of high-performance computing (HPC) storage systems presents significant challenges for I/O performance optimization. Existing automated diagnosis approaches treat each job as an independent instance, failing to leverage structural dependencies between workloads with similar I/O patterns. Manual diagnosis requires extensive domain expertise and does not scale to the millions of jobs executed on production HPC systems. Platform-level monitoring provides system-wide statistics that may not represent individual job characteristics, while clustering-based methods suffer from statistical consensus problems where group-level patterns differ from individual behaviors.
This thesis proposes a graph neural network approach for HPC I/O performance prediction and bottleneck diagnosis. The approach constructs similarity graphs from Darshan I/O profiling logs, where nodes represent individual jobs and weighted edges encode behavioral similarity. This representation enables Graph Attention Networks (GATs) and GraphSAGE models to learn from neighborhood information, leveraging contextual patterns from similar jobs during prediction. The methodology incorporates attention-based interpretability mechanisms to identify performance-critical factors and diagnose bottlenecks.
We evaluate the proposed approach on one million Darshan logs from NERSC Cori spanning 40 months of production workloads. Results demonstrate improved prediction accuracy over state-of-the-art ensemble methods, with graph-based modeling uncovering structural dependencies invisible to feature-based approaches. Case studies on real scientific applications validate that the diagnostic insights lead to actionable I/O performance improvements.
Committee: Ali Jannesari (major professor), Yang Li, and Samik Basu