The Data Page
Below are links to a small portion of the data used in my experiments.
Ling-Spam
Ling-Spam is the corpus I used for all experiments.
In order to get FeatureFinder to run properly, you must have a
directory named "Ling-Spam" in the same directory as FeatureFinder.class.
Ling-Spam Corpus- The corpus used in all experiments. Be warned - it's big and you'll have to unzip it and rename it for use with FeatureFinder.
FeatureFinder ouput - .arff Files
Here is a set of some of the output files generated by FeatureFinder
during my experiments. Use Weka classifier on these to test
classification accuracy.
Note that each of these files has a matching diagnostic file below.
Also, the name tells you what methods were used - e.g. TF_NoD_FV500.arff
uses TF feature vectors without discretization and a feature vector size of
500.
baseline.arff - The baseline file used in my research
FV20.arff
FV30.arff
FV50.arff
Stem_FV20.arff
TF_FV20.arff
TF_NoD_FV250.arff
TF_NoD_FV500.arff
TFIDF_FV20.arff
TFIDF_NoD_FV250.arff
Diagnostic Files - diagnostic_*.txt
Here are the diagnostic files for each of the .arff's listed above. They
contain, amongst other data, a list of the words used in the feature vector
ordered by MI.
Save these by right clicking and selecting "Save As..." then open them with
SpamDiagnostic.exe.
diagnostic_baseline.txt - The baseline diagnostic
diagnostic_FV20.txt
diagnostic_FV30.txt
diagnostic_FV50.txt
diagnostic_Stem_FV20.txt
diagnostic_TF_FV20.txt
diagnostic_TF_NoD_FV250.txt
diagnostic_TF_NoD_FV500.txt
diagnostic_TFIDF_FV20.txt
diagnostic_TFIDF_NoD_FV250.txt