Toward Generalized and Scalable Relation Extraction
Relation Extraction (RE) is a fundamental component of Information Extraction, aiming to identify semantic relationships between named entities in text. While traditional RE methods rely on pre-defined relation schemas and annotated data, real-world scenarios often involve unstructured sentences, unseen/novel relation types, and limited labeled data. This thesis presents a series of models that collectively advance the scalability, generalization, and adaptability of RE systems.
We begin with CoRec, a lightweight coordination recognition model that identifies coordinators and conjunct boundaries without relying on syntactic parsers. As a general-purpose pre-processing tool, CoRec enhances downstream information extraction pipelines by simplifying complex sentence structures and significantly boosting extraction yield.
Next, we propose AugURE, an Unsupervised RE framework that enhances relation representation learning through both within-sentence pairs augmentation and augmentation through cross-sentence pairs extraction. AugURE also overcomes the limitations of conventional Noise Contrastive Estimation (NCE) by adopting margin-based loss. This approach improves clustering quality and enables more accurate unsupervised relation discovery.
Finally, we address Open-world Relation Extraction with MixORE, a two-phase framework designed for corpora containing a mixture of known and novel relations. MixORE integrates novel relation detection with open-world semi-supervised joint learning. This enables the model to continuously adapt to novel relations, consistently outperforming competitive baselines in both known relation classification and novel relation discovery on multiple benchmark datasets.
Together, these contributions provide practical solutions to key challenges in relation extraction, advancing the field toward more flexible and scalable systems.
Committee: Qi Li (major professor), Hongyang Gao, Ying Cai, Kevin Liu and Zhu Zhang