An Analysis of Methods in
Feature Selection and Classification Algorithms for Spam Detection
(An Informal Guide to Spam Filtering)
Abstract from the formal paper:
The detection of spam email is a problem of growing significance to both corporate and public institutions. The problem is a classic case of two-class text classification and lends itself to many feature selection and machine learning classification algorithms. Optimal feature selection involves areas of text classification including using mutual information to determine key features, choosing an optimal number of features, and choosing an appropriate feature vector type. Once feature selection is complete, there are a wide variety of machine learning algorithms that can be used for classification. Among the classifiers most commonly implemented for this problem are Naïve Bayesian networks and a wide variety of boosting algorithms including AdaBoost.M1. This paper investigates the range and effectiveness of various methods of feature selection and classification algorithms and attempts to find the optimal combination of both.
Spam
If you've ever used electronic mail, chances are good that you've
encountered the malevolant creature known only as spam. Whether titled
"Free Timeshares!!!" or "Hot XXX Action", you can usually tell without even
looking at the message that it's spam. The problem is, how do you train a
computer to know whether a message is spam or not? At first glance, it
may seem simple - just search the message for certain words - like "XXX" or
"FREE MONEY" - and delete those. Unfortunately, spammers are much more
intelligent than that and once they understand your filter, will find ways
around it. For instance, what was once "XXX" and caught by the simple
filter, may metamorph into "--XxXX--" which is not caught by your filter.
Many early filters that worked on these naive principles are no longer
effective because spam is constantly changing. So, to counter it, we need
a filter that is constantly changing. Here we enter the fields of text
recognition and machine learning.
Text Classification
Text classification is a field that focuses on teaching machines how to
classify documents into classes. Your favorite search engine can do this
fairly well. Type in "eggplant" and a powerful machine learning algorithm
scans millions of documents and returns only those pertaining to
eggplants. Now, what if we apply this technology to classifying spam
emails? It turns out that many researchers have had a great deal of luck
using machine learning (ML) algorithms to detect spam. ML algorithms are
interesting because they can change the way they classify based on their
input. So, if a classifier stops working well after a period of time
(because the form of spam has changed), one merely needs to rebuild the
classifier using more recent emails and the ML will output a new classifier
that's much more effective. In this way, the filter can never be
outdated, and no matter how hard they try, spammers won't be able to get their
wares past our dutiful filter.
Feature Selection
Before the classifier is trained, we first need data to train it with. It
turns out that a classifier will work much better if we take the time to
analyze a bunch of average emails and determine which features
(words) will help the most in classification. These features are combined
into a feature vector for each message, which can then be used to train a
classifier. There are several ways to determine the best features, and
part of the goal of my reserach is to determine the best way to choose these
features. I created a program, FeatureFinder (see below for details),
that uses a corpus of over 3000 emails (about 16% spam) and uses a variety of
feature selection methods to find the best features to build our
classifier. Below are the methods I experimented with:
Feature Vector Size - the number of features to use when training the classifier
Feature Vector Type - Boolean, TF, or TF-IDF
Stop Terms - words that can be ignored like "a", "as", "the", etc.
Word Stemming - removing suffixes - e.g. "building" and "builder" become "build"
Discretization - whether you classify with all the data or simplify it
Mutual Information Gain - calculating how much each feature helps classify a document
For more information on how all these work, see the formal report - there's a link below.
Classification
Once the best features have been selected, it's time to create our
classifier. There are many classifiers that work quite well. Here
are the two I experimented with:
Naive Bayesian Network - a simple classifier with suprisingly accurate classifications (weka.classifiers.NaiveBayes)
AdaBoost.M1 - a boosting algorithm that uses several Naive Bayesian classifiers (weka.classifiers.AdaBoostM1 .... -W weka.classifiers.NaiveBayes ...)
These are also described more fully in the formal report.
Code
Below are links to the code I created and to the Weka homepage.
Both the classifiers I used came from the Weka package.
FeatureFinder.java - Used to create .arff for Weka classifiers. You need to compile it first, though.
SpamDiagnostic.exe - Used to analyze diagnostic files from FeatureFinder.java. Let's you see which features were used in the feature vector.
Weka Home Page - The homepage for Weka - a powerful collection of classifiers and tools.
Documentation
Below are links to all the documentation for my research.
Included are the formal paper, the documentation for FeatureFinder and
SpamDiagnostic, as well as links to the data page where some of the data from
my research is available.
Formal Report - The formal report including experiment results. (MS Word .doc format)
FeatureFinder Documentation - Explains how to run FeatureFinder. (MS Word .doc format)
SpamDiagnostic Documentation - Explains how to run SpamDiagnostic. (MS Word .doc format)
The Data Page - Link to the Data Page where experimental data files can be found.