The classification community has been overly focused on predictive
accuracy. For many data mining tasks understanding a classification rule is
as important as the accuracy of the rule itself. Going beyond the predictive
accuracy to gain an understanding of the role different variables play in
building the classifier provides an analyst with a deeper understanding of
the processes leading to cluster structure. Ultimately this is our
scientific goal, to solve a problem and understand the solution. With a
deeper level of understanding researchers can more effectively pursue
screening, preventive treatments and solutions to problems.
In the machine learning community, which is driving much of the current
research into classification, the goal is that the computer operates
independently to obtain the best solution. In data analysis, we're still a
long way off this goal. Most algorithms will require a human user twiddle
with many parameters in order to arrive at a satisfactory solution. The
human analyst is invaluable at the training phase of building a classifier.
The training stage can be laborious and time-intensive. In practice,
though, once a classifier is built the rule may be used frequently and on
large volumes of data, making it important to automate the process. So the
training stage can be viewed as a once-only process so that it is valuable
to take time, tweak parameters, plot the data, meticulously probe the data,
and allow the analyst to develop a rule peculiar to the problem at hand.
Understanding the classifier in relation to the data at hand, requires an
analyst to build a close relationship between the method and an example data
set, to determine how the method is operating on the data set and tweak it
as necessary to obtain better results for this problem. There is a tension
between future performance and performance on the existing sample that needs
to be balanced. Often the existing data set is divided into training and
test samples, to simulate the performance on future data. Statisticians
think in terms of a sample and population, with the population being
described by a probability distribution, and the sample being an instance of
the population. Statisticians are cautious about tailoring a solution to
closely to one sample when the goal is to describe an aspect of the
population. Nevertheless probability functions are often inadequate in
describing the complexities of structure in high-dimensional data and the
analyst resorts to methods such as cross-validation and bootstrapping to
provide inferential statements with some measure of uncertainty. These
approaches are more extensive than a once-off divide of a sample into
training and test sets. Regardless of the way that the analyst approaches
the future performance of a classifier, it is interesting to understand the
class structure in the existing sample.
The nature of class structure can be teased out with exploratory
statistical graphics. Visualization of the data in the training stage of
building a classifier can provide guidance in variable importance and
parameter selection.
In this project we used of graphics to build a better classifier based on
support vector machines (SVM). We plot classification boundaries and other
key aspects of the SVM solution in high-dimensional spaces. The visual
tools are based on manipulating projections of the data, and are generally
described as tour methods.