In many subject domains, knowing the prevalence and tracking it over time is important such as refining customer complaint issues and support for political candidates in the micro-target campaign. Quantification learning is a supervised learning task that focuses on the prediction of the unseen dataset on an aggregate level instead of an individual level. However, quantification learning is an under-explored research area in computer science as it is seen as a trivial, straightforward post-processing step of classification in Classify and Count (CC) methods. In this paper, we propose a synthetic data augmentation method using external sources to improve CC method and an end-to-end deep quantification learning framework called Deep Quantification Network (DQN) to directly estimate the class distribution given a test set without classification of individual documents. We studied DQN with three different strategies to generate tuplet (a set of documents) and performed sensitivity analyses of DQN when varying the number of documents in a tuplet and when changing the distribution of training samples in the training process. We evaluated DQN on three tasks (two multi-class quantification tasks and one binary quantification task). DQN reduces the quantification error from 18.6% to 50.5% compared to the best results achieved by the state-of-art methods. In addition, with 50% fewer training samples, the DQN model is still effective or more effective than the state-of-the-art methods.
Committee: Wallapak Tavanapong (major professor), Johnny Wong, Adisak Sukul, Jin Tian, Dave Peterson