The Data Page

Below are links to a small portion of the data used in my experiments. 

Ling-Spam
Ling-Spam is the corpus I used for all experiments. 
In order to get FeatureFinder to run properly, you must have a directory named "Ling-Spam" in the same directory as FeatureFinder.class.

Ling-Spam Corpus- The corpus used in all experiments. Be warned - it's big and you'll have to unzip it and rename it for use with FeatureFinder.

FeatureFinder ouput - .arff Files
Here is a set of some of the output files generated by FeatureFinder during my experiments.  Use Weka classifier on these to test classification accuracy.
Note that each of these files has a matching diagnostic file below. 
Also, the name tells you what methods were used - e.g. TF_NoD_FV500.arff uses TF feature vectors without discretization and a feature vector size of 500.

baseline.arff - The baseline file used in my research
FV20.arff
FV30.arff
FV50.arff
Stem_FV20.arff
TF_FV20.arff
TF_NoD_FV250.arff
TF_NoD_FV500.arff
TFIDF_FV20.arff
TFIDF_NoD_FV250.arff

Diagnostic Files - diagnostic_*.txt
Here are the diagnostic files for each of the .arff's listed above.  They contain, amongst other data, a list of the words used in the feature vector ordered by MI.
Save these by right clicking and selecting "Save As..." then open them with SpamDiagnostic.exe.

diagnostic_baseline.txt - The baseline diagnostic
diagnostic_FV20.txt
diagnostic_FV30.txt
diagnostic_FV50.txt
diagnostic_Stem_FV20.txt
diagnostic_TF_FV20.txt
diagnostic_TF_NoD_FV250.txt
diagnostic_TF_NoD_FV500.txt
diagnostic_TFIDF_FV20.txt
diagnostic_TFIDF_NoD_FV250.txt