PhD Preliminary Oral Exam: Samuel Fanijo
scCoCoLM: Single Cell Context-Conditioned Language Model for Cell-Type Annotation
Biology-specific foundation models for single-cell annotation require extensive pretraining runs and often degrade under study shifts. We introduce scCoCoLM, a scalable, context-conditioned language model for single-cell type annotation. scCoCoLM uses a Querying Transformer (QFormer) with cross attention to summarize a single cell expression matrix into compact context representations, and a pathway-aware functional influence stream to inject biological pathways (Reactome) derived control signals. Together, these modules condition a general-purpose language model for cell type annotation and yield strong cross-study transfer with no required biology-specific pre-training, leading to significantly fewer computational resource requirements compared to baselines. Evaluated on three public datasets (Aorta, PBMC, and hPancreas), scCoCoLM achieves an average F1 score of 94.7%, outperforming state-of-the-art baselines, including scGPT (82.8%) and Geneformer (9.9%). All results are obtained by fine-tuning in approximately 10 GPU hours on a single RTX consumer GPU, without biology-specific pre-training, compared to scGPT, pre-trained on 33 million cells for 192 hours across 16 A100 HPC GPUs, and Geneformer, pre-trained on 104 million cells for over 90 hours across 64 A100 HPC GPUs. Our results demonstrate that a com- pact, biological context-conditioned bridge to a general-purpose language model can deliver state-of-the-art cross-study annotation at a fraction of the compute used by large biologically pre-trained foundation models, without losing biological interpretability.
Committee: Julie Dickerson (co-major professor), Ali Jannesari (co-major professor), Carson Andorf, Wensheng Zhang, and Justin William Walley