EEL6825 Final Project
Bayes Filter for Spam Email
Louis Brandy

 


A Small Introduction


This webpage is meant to be the final report for my EEL6825 Pattern Recognition Class. The goal of this project was to classify emails as either 'spam' or 'not spam'. The actual concepts are rather simple for almost anyone to understand give a very basic knowledge of probability. I apologize in advance for my inability to make a decent webpage, as I hope the content makes up for the aesthetics. To the grader of this report, I intended this to have a wider audience, so some of the probability discussion is heavier on the explanation then required.


Table of Contents...
1. Problem Description
2. Approach Theory
3. Probability Equations
4. Explanation of Code
5. Results / Examples
6. Areas for Improvement
7. Appendices


Problem Description


The goal of this project was very simple. I wanted to classify incoming emails as either spam or not spam. The technique used to do this was a (naive) Bayesian classifier which took training emails, pre-classified by hand, and attempted to learn what was and was not spam. Given an adequate training sample size, the classifier should, with a high rate of accuracy, be able to determine which incoming emails were and were not spam.

On the point of accuracy, it should be noted that in any email classification system not all misclassifications are created equal. A "false-positive" is when a non-spam email is classified as spam and is discarded. This is far more heinous then allowing the occasional spam email to make it to an inbox. Any filter should not, then, be aiming for the minimum error in terms of overall misclassification, but instead for the minimum error with an acceptable level of false-positives.

An email client could have two buttons, instead of one, for deleting email. One would be a delete-as-spam, which would add that emails contents to the data that filters out future email. The other button would simply just erase the email. The email client, then, would be able, at least, to flag incoming emails as whether it thinks it’s spam or not. Clients are just now beginning to come out with this exact feature (with, given my results, took surprisingly long). Several email clients are coming out with a ‘scoring’ system based entirely around the concept of computing a Bayes probability that the email is in fact spam.

The article that I read that both inspired this project, and was referred to a good deal any time I ran into difficulties or had trouble is found here: http://www.paulgraham.com/spam.html. This article is an extremely interesting read for anyone interested in either getting rid of spam, or Bayesian classification ideas. I attempted several different things then those specified in his article though our results are both strikingly similar. I will mention any major deviations from the algorithm he specifies in the text.

Return to Top


Approach Theory


WORDCOUNT

I should start by saying that I don’t erase the emails from my computer, so I had a large set of data to begin this work. I chose a set of about 650 spam emails, and about 300 innocent emails as training data. I considered the following fields:

From: {"Name","address","type"}
To:{"name,"address","type"}
{Subject}
{Body}

Therefore, the variety of other fields that can occur in an email are ignored, only who it's from, the name they use in their email, who they send it to (always lbrandy@ufl.edu), and what name they refer to me by. Also, obviously subject and body of the email are included, as well as the type. I chose to strip all html from messages before counting words for two reasons. First, almost all spammers use html mail, so normal email in html was as good as guilty. Stripping html, at worst, could cause spam to make it to my mailbox, and leaving the html could cause real email to get trashed as spam. As is always the case when decisions like this need to be made, I always chose to favor spam making it to my inbox. Secondly, spam emails like to insert HTML comments between words they know people are looking for, like sex or free. Two comments need to be made at this point, one, Microsoft Outlook is the program I used for all my emails, and I used it's export feature to remove html, therefore I do not have code to do this. Secondly, this is the first major difference between my technique and the one detailed by Paul Graham.

The basic idea was to go through and count the occurrence of every word and then keep a table of how many times each word appears in spam email and how often a word appears in real email. The first difficulty is defining what exactly is a word? I chose a very simple definition. A word is any sequence of numbers and letters less then 15 in length. Anything that was not a letter or number is considered a separator and anything over length 15 is equal to only it’s first 15 letters. Two words longer then 15 letters whose first 15 letters are the same, are treated as equal. Finally, all letters are made lowercase. For example,

Opt-in becomes opt / in
opt in becomes opt / in
lbrandy@ufl.edu becomes lbrandy / ufl / edu
ThisIsAVeryLongWordYesYes becomes thisisaverylong
ThisIsAVeryLongWordToo becomes thisisaverylong

Once how everything would be broken into words was decided, counting up the words and storing this information into a database becomes a simple conceptual problem. Just loop through the entire length of the emails, incrementing the values as each word occurs, and adding it to the list if it's not already there. Actually implementing this in software is a bit more cumbersome to do on a scale this large, and that is covered in the code section. Once this was complete, the word-lookup table was done. Given any word, you could lookup what the frequencies of a word were in the training data. This data, in and of itself, was terribly interesting.

A noteworthy example:

Word

Spam

Good

louis

194

73

brandy

0

48

As it turns out, my first name appears more in Spam mail then it does in my real mail! However, if you use my last name, then you are virtually guaranteed to be a real sender of mail. If you sent me an email containing only my first name, "Louis", and nothing else (as a simple example, since any email must have a subject/body/to/from as well), this classifier would give it a 57% chance of being spam. Where that number 57% comes from, in relation to the above numbers is covered in the Probability section. Suffice to say that it's not as simple as dividing the spam-count by the total-count. This assumes that both spam-emails and actual-emails were of equal number in the training data. (And this presents the further difficulty of whether to use the total number of emails, or the total number of words.) Since more email for spam were used in training, words that appear equally often in both would have a higher number for the spam-count.

CLASSIFIER

Now that we have a database of word counts, we can easily create a classifier that is given an email, it looks up the probability of every word in the email. This number (ie, the 57% from above) says that "give only this word, the chances this mail is spam is...". It then "combines" the probability of all the words in an email to get the final result. Using all of the words is another major difference from my classifier to Grahams. He uses only the 15 most "interesting" where "interesting" is the farthest from the center. I.e., .01 and .99 are equally as interesting, and both are of very high interest. Using all of the words provides a different set of computational challenges as I'll explain in the Probability section.

Return to Top


Probability Equations


The entire purpose of this section is to describe how we get from a set of word-counts to a final probability for a collection of words. The very first thing that must be done is the word-counts must be "balanced". There is about 900k worth of spam email, and about 600k worth of real email in the training data. This means any word that actually appears equally often in either (ie, "is") would show up in a 9/6 ratio in the word-counts. In order to balance this effect, it would make sense to multiply the good word-counts by 9/6. This equation, therefore, is specific to my training-data size.

It follows, then,

 
                          bad-count
p(spam|word) = -----------------------------
                 bad-count + (3/2)*good-count
 

This is, in fact, the actual equation. However, it's not the one I used. Remember I said I wished to bias the system again false positives, in order to do this, all you have to do is multiply the good-numbers by a higher number. Instead of 3/2 (or 1.5), I chose to up it to 2. I have not experimented with this value at all, it's there solely for protection against false positives. Tweaking this number would be an interesting idea for future work. This new multiplier will cause all words to tend to the side of being harmless. The hardcore words (like says, spam: 5000, good:15) will barely be effected (spam:5000, good:30); The actual equation used, then,

 
                          bad-count
p(spam|word) = -----------------------------
                 bad-count + 2*good-count
 

It should be noted that this biasing idea comes from Graham. However, he uses a much higher bias then I do. His equations balance the good and bad count and then multiply the goods by 2. Whereas mine only multiplies the balanced word-counts by 4/3, or 1.333 (ie, the (3/2) is what balances it, so (3/2)*(4/3) = 2 we see in the denominator). If I had used his biasing standard, my true word-count would be multiplied by 3, not by 2. This biasing procedure, also, make p(spam|word) no longer, technically, a true probability of the training set in the sense that we have added data that's not in the original set. However, it is still a probability of a new training set, one in which the good emails are effectively counted extra (1.333 extra, to be exact, or for every 3 occurrences of a word, we count 4).

The question, then, is what does this number actually represent? It represents the probability that, in a particular email, you see this word, then the chances that this email is spam is p(spam|word). Each word in an email has a p(spam|word) and thus running this through an email will result in a long list of probabilities that must be combined.

Combining the list of probabilities is quite simple mathematically, but deceptively difficult to do in on a computer for large numbers of values to be combined. The actual formula looks like this (where a,b,c...,n are probabilities of all the words):

 
 
                                      a*b*c...*n      
p(spam|all-words) =     ----------------------------------------
                       a*b*c...*n + (1-a)*(1-b)*(1-c)...*(1-n)
 

Graham uses the 15 most interesting words, as described before, to calculate the probability something is spam, given those 15 words. This makes the computation rather simple on a computer. The difficulty comes when you try to do this calculation on all the words (potentially 1000s). Since you don't know how long the email is, you don't want to store the list first, and then calculate later. It would be far easier to keep a running "total". Combine the new probability with the total, and load up a next. When no more are left, return the total. The simplest way to do this (which does NOT WORK WELL for large values of n) is this:

 
(Pseudocode)
joint = 1;                     //Initialize
conj = 1;
Start Reading the File
 
#HERE#
GetNextWordProbability
joint = joint*new-prob
conj = conj*(1-new-prob)
GOTO HERE until no more words
 
p(spam|all-words) = joint / (join + conj)

I had initially used this technique, which works absolutely fine for small amounts of words. The problem is underflow when dealing with 100s of words. Since all probabilities are under one, then both joint and conj are both shooting towards 0, and the more numbers we multiply, the more problems we are going to have with precision and underflow. Keeping these numbers in a running total is not effective and causes some unfortunate precision errors (like divide by 0).

Instead, what I needed was a running probability total (or some representative of it). The probability goes to 0 only if the resulting email is not spam, so an underflow means that the word is almost certainly not-spam. However, it's not really possible, given the combined probability of N-1 things and the probability of N, to combine them directly. Instead, the following manipulations were done to make this computationally more sound:

 
 
Just notation:
a..n => a*b*c*...*n
(1-a)..(1-n) => (1-a)(1-b)(1-c)...(1-n)
 
           a..n
P = -------------------
    a..n + (1-a)..(1-n)
 
 
Therefore,
                (1-a)..(1-n)
1/p = 1   +  ----------------
                   a...n
 
 
Further,
1/p - 1 = (1-a)..(1-n)
          ------------
              a..n
 

This creates means that {(1/p) - 1} can be represented by a running total like this:

 
(Pseudocode)
total = 1;                     //The running total
Start Reading the File
 
#HERE#
GetNextWordProbability
total = total * (1-newprob)
total = total / newprob
GOTO HERE until no more words
 
p(spam|all-words) = 1/(total + 1)

In this example, total is equal to the 1/p plus 1. This means, where, p varies from 1 to 0, total will vary from 0 to infinity (well, to the maximums of double-precision floating point). High numbers mean the email is very likely innocent, and low numbers of total mean the email is very likely spam. Contrast this with the original approach, where "conj" and "joint" both went to zero regardless of the e-mail’s nature. This also means underflows and overflows (which didn't occur in any of my training set) is essentially reasonable evidence of the nature of the associated file. Such events could also force that particular email to pass the filter, regardless, just to ensure for safety. Also, total is initialized to 1 because 1/(total + 1) is equal to .5, in other words, whenever total is equal to 1, whatever has happened before is no longer relevant because it has all essentially canceled itself out.

Return to Top


Code Description


Unfortunately, the code is the weakest part of the project. Essentially the code, was put together to do my specific required tasks and is extremely far from being clean, efficient, or portable. Given more time, I'd love to clean up the code and make it far more portable and efficient. I will try to explain, here, what data-structures and organization is used in the code and how it's set up. Each of the 5 c files will be explained to purpose and it's use. All 5 c files are completely independent even though they certainly share quite a few functions. The first part of reorganizing this code would be to create a header-file library of the functions used in more then one program.

featureextract.c
Purpose: Take a single file as input. Tokenize input. Increment correct table values for each word.

This program has a single hard-coded string for the file-input name. Outlook exports entire folders of email to one-folder. This means my SPAM-TRAIN folder and REAL-TRAIN folder both got exported to two separate files. This program, then, was run twice, being recompiled with the new file-name for each. Also, you need to manually set which field (good or spam) is being incremented on recompiles to switch types.

The tokenizing process is actually extremely simple. First, and foremost, tokens are always 15 characters long and are always initialized to all spaces. The basic function required is a filter() function which does the following:

 
 
char filter(char c)
{
 
        /* filter() takes any char and returns only numbers and letters,  */
        /* all other ASCII characters are replaced by space. All letters  */
        /* are lowercased */
 
        if((c > 64) && (c < 91))
               c = c + 32;
        if(((c > 47) && (c < 58)) || ((c < 123) && (c > 96)))
        {
               return c;
        }
        else
            {
               return 32;
        }
}

Essentially, filter() makes upper-case letters lowercase, it then returns all numbers and lower-case letters. If any other character was input, it returns a space. This allows the tokenizer to simply clear a token variable, read character at a time, filter it, as soon as it gets something not a space, it begins copying it into the token until it gets a space (or it gets to the maximum of 15). The token is filled via a pointer passed in, so the calling program gives this function the string to be filled. Note, the file pointer is passed into this function, so the position in the file is maintained from call to call of this function. It returns a 0 if it has reached the end of the file.

Once a tokenizer is returning a fixed-length string, all that has to happen is the correct value of that token in some data-structure needs to be updated. Since the purpose of this data-structure is to lookup words later, a hash-table or a tree is likely the smart data-structure. I chose to use a hash-table since a file system gives you an easy way to hash out various values by using files. A token belongs in a file named by it's first letter, and length. So "s12.txt" contains the information for all 12 letter s-words. This separates a large portion of words. There is no ordering inside these files, so searching is sequential. The hash-table setup and searching functions could definitely use a revamp for efficiency. I chose this method because it was very simple to code as the data-structure is already in place, and the searching algorithm is trivial. To extract features from near 1.5 megabytes of emails took about 45 minutes using this code. On the other hand, classifying about .5 megabytes of email took only about 3 minutes.

The process is encapsulated in an update() function which basically takes the token and the source-file for that token (note, it could easily generate the source-file name from the token in my setup, since it's the first-letter of the token, followed by the length of the token. However, I left this in so that update() could be changed do a different scheme without changing the declarations or usage). Update then looks in source-file for token, updates the correct field (hard-coded, each compilation sets this based on which class is being used as a source file), or creates the token's field with hard-coded initial values if it's not in the file already. This encapsulation was done intentionally so that the update procedure could be updated easier for a different organization.

emailsplit.c
Purpose:
Split Microsoft outlook export data into individual files

This file is very application specific but is included for those who may need to split Microsoft-outlook exports into individual files. Essentially, outlook exports " denoted, and comma separated files. This program counts through the fields I used (8 fields, total) writes those 8 to a file, and then reads the next 8, writes that to a new file, etc. The files are numbered sequentially.

For training purposes, it really didn't matter where one email ended, and where one began, but for classification it does. Therefore, all classification data is split before it is classified.

classify.c
Purpose: Loop through a set of files, find the probability of spam of each file, output file.dat for each, output overall.txt

First, and foremost, this file loops through a directory, starting at '1.txt' classifying each file going in numerical order until it fails to open one. That's when it's finished. For each file, it outputs a '1.dat' for example. This file is also a text file in which each token is on it's own line with the probability for that word, and that word's word counts. At the end of the dat file is the overall combined probability give by the formula in the Probability section.

Examples of the original email is here with it's corresponding dat file . (note: the actual output is a .dat file, which is a common extension, so for the sake of this report, I've renamed all the .dat's to .txt to make it easier to display.)

Finally, this program also puts in the directory it was looping through a file called output.txt which shows the final probability of everything it attempted to classify. Since I separated my spam and my non-spam into different directories, each run of this program should have produced all spam or all not-spam probabilities. An example of an output.txt is here.

htmlmaker.c
PURPOSE: Takes an email as input, and outputs that email tokenized and colorized according to each word's probability.
combine.c
PURPOSE: Combines all the hash data into one file called all.txt.

Both of these files serve no purpose in actually classifying the data, but they both are useful in showing the intermediate results. The combine program puts all of the hash table information into one file to make it easy to import into Excel for various outputs. Some of the tables in the results section were created in Excel using this file.

The HTML-maker program colorizes each word of an email according to its probability of being spam. This is purely for demonstrational purposes, and will be shown in it's full-effect in the Results section.

Return to Top


Results - Examples


First, this section will show the various results and intermediate step through the classifier. These links show the spam datasets (I didn't include the actual emails of my real email for the sake of privacy. I will show examples, and the hash table, however):
Spam-Training set
Spam-Test set

Running that training data, to produce a hash-table, produces fairly intriguing results even before the data is used to classify email. For instance, I used excel to compile the list of the most frequent words that appear in 90% or more in either type of email. I present the following lists of the spammiest words in the email I receive, as well as the most innocent words I receive:
Top 30 innocent words
Top 30 spam words

There are several noteworthy entries on those lists. Perhaps the most interesting on the innocent list was things like hotmail, yahoo, msn. Clearly, emails coming from the common email providers are usually alright. This approach, then, essentially encapsulates and extends the concept of having a 'white-list' of email addresses to allow to send you email. The spammiest words are absolutely no surprise. It shouldn't surprise anyone that "Free offers: click here!" is the spammiest sentence you can send anyone. However, this process works far better then using manually created blacklists, since it finds many words you wouldn't necessarily think to. Perhaps the most intriguing is the token "0" which would include things like "0%" and "0 down".

Classifier
The final results are summarized below:

Data set

Link

Total Emails

Misclassified

Percentage

Real Test Data

output-file

71

1

98.6%

Spam Test Data

output-file

302

2

99.3%


The final classification is 99.2%. Below is a few emails which were classified well, as well as the 3 misclassified emails. Including in the links are the html-ized emails which show each word colored according to it's spamminess. There are 5 classes of words. Red indicates a spammy word, white a "intermediate" word, and blue is an innocent word. Words in bigger fonts are on the extreme end (above 95% for one side or the other). These html files were created using the htmlmaker.c file. Included, below, there is an example of a spam-mail that was almost entirely in html (spam10) that almost all of the email was removed when the html was stripped (mostly pictures). It was still classified correctly.

Email

Type

Classified-As

Output-file

Colorized email

Good10

GOOD

GOOD

Output

html

Good18

GOOD

GOOD

Output

html

Good30

GOOD

GOOD

Output

html

Spam3

SPAM

SPAM

Output

html

Spam10

SPAM

SPAM

Output

html

Spam200

SPAM

SPAM

Output

html

*Good50

GOOD

SPAM

Output

html

*Spam232

SPAM

GOOD

Output

html

*Spam66

SPAM

GOOD

Output

html


Good50 was the lone false-positive in my dataset. This was, partly, my fault. While this email was a solicited email, it is clearly a mass-mailing type of letter. The reason it probably is my fault is because I'm in several fantasy football leagues, and several of the websites (including the nfl.com one) use my email address from signing up to send me email about their websites. For this reason, there is quite a bit of fantasy-football related spam mail in the training folder. I trained my classifier to discard this exact type of email, and it did. This email is also full of "sell jobs" which doesn't help it's probabilities either. This email could really be classified either way, depending on your definition. Looking at the html of this email, you'll notice that there are very few words for innocent or guilt. This is an example of where limiting the number of words (ie, the most interesting) would have an effect. There are more innocent words on the extreme side, but many of the "in-between" words fall to the spam side, so just the sheer number of them overwhelm the more extremely innocent words. In a limited word model, this email probably would have come up innocent.

Spam232 is a strange type of error that I expected. When I was a freshman, I joined the NSCS, and have been receiving their emails for a long time. I have not been a member for several years. I included the NSCS mailings as spam (in the training data) almost entirely to truly test the spammer. In all honestly, it's not fair to ask such a system to classify these types of mails since they are solicited, therefore likely contain many words that would normally appear in your email. This particular one just happened to contain several words that are quite appropriate for a real email. This is another case of almost every uncolorized word falling to the side of incorrect classification. If there was a limit to the number of words included (the most interesting ones), then this would definitely be classified correctly.

Spam66 was the last email misclassified. This email, to be quite honest, was mistakenly misclassified by me. After actually reading this email after the results came back, I decided that this is certainly a borderline case of whether it's solicited or not, but you cannot expect the classifier to throw out an email from ufl, about scholarships and graduate school. I'm more inclined to chalk this error up to human error, as the computer did a better job then I did deciding if this email was pertinent reading for me.


Return to Top


Future Work


There are quite a few ways that this classifier could be extended or improved. However, with a classification rate over 99%, with the few misclassifications to be obviously borderline, I'm not sure the model for dissecting and classifying emails needs to be changed at all. There are obviously several ways to change the fundamental model including html, dissecting words to find two paired up without a space, looking at word-pairs like 'click here' and 'free offer'. The few misclassifications could possibly be corrected by limiting the amount of words used in the final probability computation, but then non-borderline cases would have a higher chance of being misclassified. I believe this approach to be alot safer.

The actual implementation of this model could use quite a bit of work. The code is not very clean, nor portable. I wrote that code to do useful things for this project, not to be easy to be used by others. Further, the hash-table data structure is extremely crude and inefficient. With a better fundamental data-structure, this code could be made to run far more efficiently for a real world application. However, even with this crude approach, adding a single emails contents to this table doesn't take any noticeable amount of time.

Overall, I was amazed at the final results, even though the article that I read hinted that the results were alarmingly easy. I am extremely surprised that I could sit down and, in less then a week, have a filter that can remove over 99% of spam emails with very few false-positives. Furthermore, I'm amazed that something so simple hasn't actually made it's way into major email-clients sooner. I imagine, extremely soon, email-clients will have two delete buttons, one which adds that e-mail’s contents to the "good" side, and one that adds it to the "bad" side. The email client will probably have an option to flag emails, or just flat-out block emails, that it deems spam.

Return to Top


Appendices


1. Spam-Training set (975k)
2. Spam-Test set (444k)
3. Combined Hash Table (240k)
4. Entire results from Spam-Test set (zip file, 845k)

Again, the real-training, real-test, and real-results are not included for the sole reason of keeping all of email off the internet.


Return to Top