Computer Science PhD student Harris Lin received the best student paper award at the 2013 IEEE Big Data Congress for his paper "Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores" which he coauthored with his advisor Professor Vasant Honavar.
The emergence of linked open data (LOD--simple, powerful, and practical means to publish structured data (in the form of subject-predicate-object triples, enriched with metadata in the form of RDF schema which in turn can use name spaces whose semantics are provided by expressive OWL ontologies), and more importantly, link the data - offers unprecedented opportunities for automated knowledge acquisition in many areas of human endeavor. In short, LOD allows disparate structured data sets to be linked in a way that is analogous to the way typically unstructured documents are linked via hyperlinks on the web. The LOD collection currently includes several hundreds of linked data sets that together contain many billions of RDF triples, covering domains as diverse as government, life sciences, health sciences, agricultural sciences, social media, business and commerce. The power and utility of LOD can only be expected to increase as additional data sets become available and new links between the data sets get established. Major open data initiatives are currently underway around the world, together with the development of more effective and easy-to-use tools for publishing, managing, querying, and linking large amounts of data, can be expected to lead to exponential growth in LOD following a pattern similar to the growth of the Web. However, fully realizing the potential of LOD in discovery and decision making calls for effective methods for building predictive models, or extracting useful knowledge from LOD.
While machine learning approaches currently offer the most cost-effective means of constructing predictive models form data, the applicability of current approaches is severely limited in the LOD setting because it is not feasible to retrieve large amounts of LOD for analysis due to access, memory, bandwidth, computational restrictions, and in some instances, privacy and confidentiality considerations. Harris Lin's previous research published in 2011 in the International Semantic Web Conference, on learning predictive models from RDF data, was the first attempt of its kind which provided a proof of concept of learning rather simple predictive models i.e., relational Naive Bayes classifiers from RDF data in a setting where the data located at a single remote RDF data store that could be accessed only through a query interface that answers queries posed in the RDF query language SPARQL.
The IEEE Big Data Congress paper builds on this work to consider the problem of learning predictive models from multiple interlinked RDF stores. Specifically, the paper (i) introduces a statistical query based formulations of several representative algorithms for learning classifiers from RDF data; (ii) introduces a distributed learning framework to learn classifiers from multiple interlinked RDF stores that form a chain; (iii) identifies three special cases of links between RDF data sets in LOD and describe effective strategies for learning predictive models in each case; (iv) in settings where it is not feasible to obtain the statistics needed for learning in the absence of centralized access to the entire linked data set, examines a novel application of a matrix reconstruction technique from the field of Computerized Tomography to approximate the statistics needed by the learning algorithm from projections using count queries, thus dramatically reducing the amount of information transmitted from the remote data sources to the learner; and (v) reports results of experiments with a real-world social network data set that demonstrate that the proposed approach yields results that are competitive with those obtained in settings where the learner has direct, in memory access to the entire linked data set, but at a significantly lower communication cost. The kinds of links between RDF data sets considered in this paper are relatively simple, albeit interesting special cases. Work in progress is aimed at extending these results to settings where LOD consist of multiple RDF stores connected through more complex linkage patterns, under different assumptions regarding the nature of links, different constraints on access and processing, and the operations supported by the data sources; and the application of the resulting algorithms to extract useful knowledge from LOD in several applications drawn from Computational Biology, Health Informatics, Social Network Analytics, among others.
This research was funded in part by an NSF grant IIS 0711356 and by the Iowa State University Center for Computational Intelligence, Learning, and Discovery.