MS Final Oral Exam: Mohammed Musthafa Rafi
Benchmarking Tabular Foundation Models for Agricultural Yield Prediction
Accurate crop yield prediction is crucial for global food security and agricultural planning. This thesis benchmarks modern tabular foundation models and automated machine learning frameworks for agricultural yield prediction across three diverse datasets: US soybean yields with 86,101 temporal sequences, global multi-crop data with 28,242 samples across 101 countries, and EU-27 regional crops with 8,656 samples containing significant missing data. We evaluate TabPFNv2, a transformer-based tabular foundation model that performs in-context learning on synthetic pre-training data, against AutoGluon and PyCaret, two state-of-the-art AutoML frameworks. Our results demonstrate that model performance is highly context-dependent: AutoGluon performs best on large-scale complete data, PyCaret excels in diverse multi-crop scenarios, while TabPFNv2 shows distinct advantages on datasets with missing values, achieving a 2.18 percentage point gain in R² on the EU-27 dataset. We also develop dataset-specific preprocessing pipelines that handle temporal aggregation, missing values, and feature engineering for agricultural data. These findings provide practical guidelines for model selection in agricultural AI, showing that foundation models offer robust zero-shot predictions particularly when handling incomplete data, while AutoML frameworks remain preferable when large, complete training sets are available.
Committee: Adarsh Krishnamurthy (major professor), Soumik Sarkar (major professor) and Baskar Ganapathysubramanian