ChrisMyden.com - Fighting The War Against SPAM

Home >> Articles >> Technology

Fighting The War Against SPAM
May 05, 2003

You've probably noticed that your inbox is getting more and more SPAM each and every day. Recently I came across a plugin for Outlook that has reduced the amount of SPAM I receive to nearly zero. To date, it hasn't made any false identifications.

It works based on the Bayesian Analysis principle, and determines if your e-mail is junk or not using a scoring system.

If you would like to learn more, and download a Bayesian plugin for your favorite e-mail client, click here.

How it works

Bayesian spam filters build the list themselves. Ideally, you start with a (big) bunch of emails that you have classified as spam, and another bunch of good mail. The filters look at both, and analyze the legitimate mail as well as the spam to calculate the probability of various characteristics appearing in spam, and in good mail.

The characteristics a Bayesian spam filter can look at can be the words in the body of the message, of course, and its headers (senders and message paths, for example!), but also other aspects such as HTML code (like colors), or even word pairs and phrases.

If a word, "Cartesian" for example, never appears in spam but often in your legitimate mail, the probability of "Cartesian" indicating spam is near zero. "Toner", on the other hand, appears exclusively, and often, in spam. Advertisement

"Toner" has a very high probability of being found in spam, not much below 1 (100%). When a new message arrives, it is analyzed by the Bayesian spam filter, and the probability of the complete message being spam is calculated using the individual characteristics. Let's say a message contains both "Cartesian" and "toner". From these words alone it's not yet clear whether we have spam or legit mail. But other characteristics will (most probably) indicate a probability that allows the filter to classify the message as either spam or good mail.

Bayesian Spam Filters Can Adapt Automatically

Now that we have a classification, the message can be used to train the filter further. In this case, either the probability of "Cartesian" indicating good mail is lowered (if the message containing both "Cartesian" and "toner" is found to be spam), or the probability of "toner" indicating spam must be reconsidered.

This way Bayesian filters can learn from both their own and the user's decisions (if she manually corrects a misjudgment by the filters). The adaptability of Bayesian filtering also makes sure they are most effective for the individual email user. While most people's spam may have similar characteristics, the legitimate mail is characteristically different for everybody.

Where can I get a Bayesian spam filter?

SpamBayes Outlook Addin (the one I use, does not support Outlook Express)

Spamihilator (supports Outlook 2000/XP/Express, Eudora, Pegasus Mail, Phoenix Mail, Opera, Mozilla, Netscape, and others)

There are many others. Just search on Google for Bayesian Spam Software.

Cool, can I use this to filter my Hotmail mail?

I've noticed that Outlook does not let you use plugins with HTTP based servers, so to get around this I installed Hotmail Popper 2.0 which is a free piece of software that lets you view your e-mail as though it were POP based. (In any of your favorite e-mail clients)