SVM Visualization

horizontal rule

Home
Publications
Software
Data Sets
Members
Contact
Doina's Home

 

Project Summary

The classification community has been overly focused on predictive accuracy. For many data mining tasks understanding a classification rule is as important as the accuracy of the rule itself. Going beyond the predictive accuracy to gain an understanding of the role different variables play in building the classifier provides an analyst with a deeper understanding of the processes leading to cluster structure. Ultimately this is our scientific goal, to solve a problem and understand the solution. With a deeper level of understanding researchers can more effectively pursue screening, preventive treatments and solutions to problems.

In the machine learning community, which is driving much of the current research into classification, the goal is that the computer operates independently to obtain the best solution. In data analysis, we're still a long way off this goal. Most algorithms will require a human user twiddle with many parameters in order to arrive at a satisfactory solution. The human analyst is invaluable at the training phase of building a classifier.

The training stage can be laborious and time-intensive. In practice, though, once a classifier is built the rule may be used frequently and on large volumes of data, making it important to automate the process. So the training stage can be viewed as a once-only process so that it is valuable to take time, tweak parameters, plot the data, meticulously probe the data, and allow the analyst to develop a rule peculiar to the problem at hand.

Understanding the classifier in relation to the data at hand, requires an analyst to build a close relationship between the method and an example data set, to determine how the method is operating on the data set and tweak it as necessary to obtain better results for this problem. There is a tension between future performance and performance on the existing sample that needs to be balanced. Often the existing data set is divided into training and test samples, to simulate the performance on future data. Statisticians think in terms of a sample and population, with the population being described by a probability distribution, and the sample being an instance of the population. Statisticians are cautious about tailoring a solution to closely to one sample when the goal is to describe an aspect of the population. Nevertheless probability functions are often inadequate in describing the complexities of structure in high-dimensional data and the analyst resorts to methods such as cross-validation and bootstrapping to provide inferential statements with some measure of uncertainty. These approaches are more extensive than a once-off divide of a sample into training and test sets. Regardless of the way that the analyst approaches the future performance of a classifier, it is interesting to understand the class structure in the existing sample.

The nature of class structure can be teased out with exploratory statistical graphics. Visualization of the data in the training stage of building a classifier can provide guidance in variable importance and parameter selection.

In this project we used of graphics to build a better classifier based on support vector machines (SVM). We plot classification boundaries and other key aspects of the SVM solution  in high-dimensional spaces. The visual tools are based on manipulating projections of the data, and are generally described as tour methods.

Related Project

Limn

 

Acknowledgements: This research was supported by funding from the National Science Foundation (#9982341).

Top

horizontal rule

Home | Publications | Software | Data Sets | Members | Contact | Doina's Home


          Copyright 2005, Iowa State University
This page is maintained by Doina Caragea. Last updated: 01/21/06.