Quality Aware Annotations with Limited Budget
Hiring domain experts to annotate a corpus is costly in terms of money and time. Alternative to domain experts, crowd workers or readily available pre-trained language models can be employed to obtain annotations with considerably less time and money. However, the crowd workers are non-experts resulting in noisier annotations than experts. Furthermore, the annotation task may be complex and the crowd workers may not have sufficient knowledge required for the task. Moreover, pre-trained language models can also provide noisier annotations and these models may have similar architectures or be trained on similar datasets, resulting in correlations in the provided annotations.
To tackle these challenges, we propose to detect the true information from conflicting and noisier information provided by heterogeneous sources by modeling source reliability and complex dependencies in data aggregation. Intuitively, the inferred aggregation results will likely be accurate if supported by many sources and follow the data dependencies. If a source provides many truths, then the source is more reliable. Existing literature in annotation aggregation assumes that workers are independent and instances are independent and, thus, cannot handle complex data dependencies and worker correlations. In this dissertation, we design truth discovery approaches to handle aggregation challenges for complex tasks such as sequential labeling, constituency parse tree, and dependency parse tree. Effective optimization-based strategies and strategies employing probabilistic models are proposed, and their performance is evaluated on real-world applications.
Truth discovery approaches are employed as a post-processing step and require atleast one annotation for all instances in the corpus. When the budget is insufficient to obtain annotations for all instances in the corpus and pre-trained language models are unavailable for the task, we have to design better mechanisms to obtain annotations from crowd workers. Specifically, the unlabeled samples in the corpus can have label dependencies, and exploiting this label dependency during crowd annotations can improve data labeling quality within the given budget. This dissertation proposes two optimization-based strategies that exploit label dependencies to achieve good classification performance within the given budget on a graph of instances. The first strategy assumes all the connected instances in the graph have the same label dependencies, and the second strategy relaxes this assumption and estimates the label dependencies between connected instances in the graph. The proposed methods help requesters manage the provided budget wisely by choosing influential instances to obtain crowd annotations.