Semantics and Anomaly Preserving Sampling Strategy for Big Data
Abstract: In this research study, we will motivate semantics preserving sampling using three use cases: visualization of large-scale data, sending large-scale data over limited bandwidth channels, and processing data generated by high throughput sensors. Visualization of large-scale data for exploratory data analysis is a challenge—the data size quickly exceeds the capabilities of existing visualization tools. Sending large-scale data over limited bandwidth channels can be prohibitive—data transfer becomes a bottleneck. Data generated by high throughput sensors can overwhelm storage capabilities—to fit generated data within the available capacity, some data is dropped, or the sensor is turned on periodically. Typically, sampling strategies are used for data reduction to overcome these hurdles. While sampling schemes have been designed to preserve certain statistical properties of the population, important peaks and anomalous behaviors are lost. We have conducted an experimental evaluation of semantics preserving sampling using trend line data as an example and found its advantages compared to traditional data reduction techniques. Our evaluation using seven large datasets shows that the proposed technique performs well compared to existing approaches in improving visualization quality as measured by image similarly metrics. Our user study using Amazon Mechanical Turk reveals that users prefer visualization produced by the proposed approach compared to other techniques.
Committee Members: Hridesh Rajan (Major Professor), David Fernandez-Baca, Qi Li, Wallapak Tavanapong, and Pavankumar Aduri