Information Extraction with Weak Supervision
Information Extraction (IE) is an important task in Natural Language Processing (NLP) domain that aims to automatically extract structured information from unstructured text. IE involves many fundamental NLP tasks, such as Named Entity Recognition (NER), Relation Extraction (RE), and Entity Linking (EL). These tasks play crucial roles in applications such as knowledge graph construction. While the development of advanced deep learning techniques brings significant progress in these tasks, deep learning methods usually require a large amount of training data with high-quality human annotations which are very costly. To reduce the cost of human efforts, recent studies have been trying to propose alternative approaches with weak supervision for IE tasks. Weak supervision refers to a machine learning paradigm in which the training data is not fully labeled or imperfectly labeled using heuristics, rules, or other indirect sources. As a result, IE tasks with weak supervision suffer from inferior performance. In our work, we specifically study the NER and RE tasks with distant supervision, a type of weak supervision where noisy labels are obtained from external dictionaries and/or knowledge bases, and aim to achieve high performances on the two tasks. For the Distantly Supervised NER (DSNER) problem, we formulate it via Multi-class Positive and Unlabeled (MPU) learning and propose a theoretically and practically novel CONFidence-based MPU (Conf-MPU) approach. For the Distantly Supervised RE (DSRE) problem, we propose a novel DSRE-NLI framework, which considers both distant supervision from existing knowledge bases and indirect supervision from pretrained language models for other tasks. Extensive experiments demonstrate the superiority of the proposed methods over the previous baseline methods on both tasks.
Committee: Qi Li (major professor), Hongyang Gao, Zhu Zhang, Ying Cai, Jia Liu