Papers  
 
 


Justifying Statistical Filtering

(and Open Source Technology)

Jonathan A. Zdziarski
jonathan@nuclearelephant.com

June 2005

Two types of people are likely to read this. The first is systems administrators and the second are executive managers. Most systems administrators are smart fellowes. They're prone to have a solid understanding of statistical filtering and its value, and are usually the last people you have to evangelise about its benefits. Executive management can go either way. You might be a smart manager, trying to put a proposal and case study together for deploying a form of statistical filtering (Bayesian, Markovian, etc) or you may be silently Googling around after feeling really stupid about the current solution you've implemented (against everybody's recommendation). In either case, this article is designed to provide a wake-up call to anyone running a commercial solution (or even thinking about it), and will hopefully explain why statistical solutions are a wiser decision.

I don't cover a whole lot of technical detail here, but I do think it's necessary to at least define a statistical solution in contrast to the heuristic last-gen solutions. Statistical solutions represent the latest generation of spam filter. They have some inherent AI (Artificial Intelligence) qualities that make them particularly good at what they do. This family of filters includes the now-popular Bayesian filters (pronounced "bay zee in") as well as other filters using statistical analysis to filter spam (such as Markovian classifier CRM114 and Chi-Square Bogofilter). In a nutshell, statistical filters are very different from other filters because they actually read email (well, sort of)...

Statistical Filters in a Nutshell

When a statical filter receives an email, it first breaks the message down into tiny little pieces (called tokens). These can consist of words, word pairs, short phrases, or even just a letter or two. How it does this is really up to the filter author. Once the message is broken up into tokens, the disposition of each token is examined. Historically, if the word 'Viagra' showed up in spam most of the time, then that token will stick out as a guilty marker. Likewise, if the word 'eBay' has been seen primarily in legitimate messages, it will become a good marker of innocent mail (ham). There is a specific "probability" that determines the final value of the marker. For example, if the word 'pizza' has shown up in spam exactly as often as in legitimate mail, then the word 'pizza' has a 50% chance of being spam (or not spam). The filter crunches these numbers and can determine whether or not it believes the message is spam by calculating a statistical probability. For example, if the message has a 92% likelihood of being spam, then a Bayesian filter will indeed mark such a message as spam.

92% is a lot easier to understand for most of us than a score of 3.72. Most people haven't really got any idea what 3.72 means, including the filter authors. So not only do statistical filters rely primarily on mathematics to filter spam, but they also speak the same language most of us do.

Technical Excellence

This class of filters are commonly hailed for their extremely high levels of accuracy. Why are they so accurate you ask? Well for one, they are analyzing each user's email individually, so they're able to adapt to whatever email behavior the user specifically exhibits. If the user is into online shopping, the filter will be less likely to make the mistake of marking some of their legitimate shopping mail as spam. Another reason they're so accurate is because they learn and adapt very quickly when they make a mistake, and can conform themselves to users like a glove. If the filter makes a mistake, it begins to (with temperament) adjust its internal clockwork so that it won't make the same mistake in the future (this is what makes the filter learn).

In contrast, the "other" type of well known filters are called heuristic filters. Instead of making their own decisions about what spam is, they rely on a programmer to write a set of detection rules. Just like most popular virus scanners today, their "spam definitions" come to be out of date very quickly, as spam evolves. Because heuristic filters have no learning mechanism, they rely on frequent updates. Another reason heuristic filters are so terrible at what they do is because these rules are written for the entire world to use - and the entire world gets a whole lot of different email. If your email doesn't fit into what the programmer considers "the norm", you're likely to have a bunch of mail erroneously marked as spam.

SpamAssassin, for example, was (and still is for the most part) one of these types of filters, however they've recently added a statistical "component" to the filter. So now it's not a heuristic filter and it's not a statistical filter - it's more like a gas/electric hybrid. If you're one of the 27 people who drive a gas/electric hybrid, you probably realize it's not particularly powerful. This also represents how most appliances are built today - several layers of heuristic tests and then stick a Bayesian element in at the bottom. Hybrid filters don't seem to be quite as powerfil either as they're more of a hodgepodge of tools thrown together than any type of technologically meaningful solution. In fact, due to what I call heuristic programming, any statistical components in the solution can end up acting dumbed down as a result of being told what ham and spam is by lesser-accurate (namely heuristic) parts of the filter. It seems rather asinine to use the less accurate portion of the filter to train the more accurate portion, but that's how most hybrid filters are concocted today. Statistical filters will learn whatever you tell them to, so naturally when they're trained by something dumber than a human, they're going to react dumber than a human. I've talk to many people today using commercial appliances based on either SpamAssassin or some other hybrid model, and it's quite scary to hear that they're still deleting spam out of their inbox by the dozens.

Statistical filtering has now been mainstream for about three years, but despite its technical excellence, most appliance manufacturers are outfitting their boxes with the older style filters and even though the box is technically "new", this old technology is winding up on many networks. We'll get into that shortly.


Open-Source Roots

One of the noteworthy mentions about spam filtering in general, and especially true of statistical filtering is that open source tools really have obtained the upper hand in this venture. Statistical filtering is a technology developed by the open-source community and copied by the commercial industry, which is quite the stumbling block to companies like Microsoft, who have frequently positioned the open-source community to appear as a group of pirates (aargh!) who carbon-copy technology. Not only have our beloved open source solutions proven to provide some of the best results in the industry, but they're also free.

In spite of this, the Internet is having a huge junk sale on anti-spam products. There are many corporations pushing anti-spam solutions, some that are even claiming to have some creative ownership in spam filtering technology. In fact, hundreds of millions of dollars are being spent every year to purchase appliances that deliver a tenth of the results that the open source community is giving away for free. Some say this is due to support, or the need for rapid deployment, but in reality it's because the majority of commercial appliance manufacturers find statistical filtering so highly accurate that it can do its job without them. Regardless of whether it's Bayesian, Markovian, Chi-Square, or whatever - manufacturers have had to face the decision of either losing money (on nightly energizer updates and the like) or crippling their own filtering solution to require such constant maintenance that consumers will subscribe to all possible services for fear of their filter degenerating - which it will.

Much to the chagrin of the filter manufacturers, all reasonably written open source statistical filters actually get better with time (and on their own), like a fine wine. Imagine a world where there are no rule sets to update, no whitelists to maintain, and only minor tweaking by a sysadmin occasionally to blow the dust out of the fans. You've just imagined the next generation spam filter. In fact, many of the tests out there showing statistical filtering as superior don't even know the half of it - there's just nothing like a nice seasoned database that's been learning for a year or more. And sadly, this well oiled machine just doesn't fit into the monthly recurring business models of most manufacturers. The best you can hope for in many commercial solutions is a Bayesian "element". This really is more of marketing buzz than anything, however, as you'll find it buried deep below several more "heuristic" layers of analysis - all of which dumb down the true learning potential of any statistical elements in the box.

The sad truth of the matter is that most people have a knee-jerk reaction to spend money in order to own a pretty box. Depending on the color scheme of the server room, you've got the choice of aqua blue, earth-tone green, or luscious yellow. It sure looks good bolted into that rack in the server room, and the fact that it cost $50,000 gives CEOs bragging rights with equally vendor-conditioned customers. Much like other well-marketed solutions, many of the appliances out there deliver in the board room better than they deliver in the server room. I initially thought that after the first dot-bomb, corporate America would begin to wake up and see through all the marketing glitz poured over what are, for the most part, substandard products. In spite of the new found financial sobriety in most technology companies today, many are still falling victim to make decisions based on buzz rather than actual technical specification. And since customers are sensitive to buzz, it's sometimes better to actually buy a well marketed product than one that performs well. This leads me to one thing I've learned over the past ten years of working startups: many corporate executives aren't really interested in technology as much as image.

This rant does have a point, and I do want to address some of the many reasons large corporations should be considering open source solutions - especially the many superior open-source solutions that are available for eliminating spam and light years ahead of commercial solutions. I'm by no means against free enterprise - I in fact would love to see a few anti-spam startups get out there and market some truly statistical, adaptive products that have a chance at solving the spam problem (Death2Spam is one such product, I hear). What I do take issue with is that a legitimate company ought to have a viable product. How you define a viable product is open to some creative license, but certainly part of it must mean that it's better than something you can get for free.

If you're a frustrated employee at a large corporation or Internet service provider and can't get your hands around why others don't get the value of open source, you're not alone. Your managers will probably be hitting you with some questions you may not have even thought about. It seems so obvious to the thinking population why a better functioning open source solution should be preferred over an inferior commercial product, that sometimes we don't even think about the details. The rest of this article is dedicated to explaining why open source solutions especially make sense in the setting of spam filtering.

Cheap Pickup Lines

We've been fed some cheap pickup lines, and most of us have fallen for them at one point. But at the end of the day, free kittens and $25 kittens share two things in common: they both meow and poop. The commercial solution might not be justified by that of technical specifications, but ROI (as I said, corporate America isn't interested in technology).

The rational question for the technologically challenged, and where some of the ammunition resides in justifying open source, is in return on investment. Does filtering have a better return on investment than managing spam? Does a commercial product have a better ROI than an open-source solution? Some of the common questions non-technical leadership will likely have are provided in the next section and will be crucial for pushing some CEOs over the hump of needing an aesthetically pleasing case. We'll also dispel some of the marketing myths and choice pickup lines you'll usually hear from commercial software gigolo's to sell you their product.

Marketing Pickup Line #1: You need support

Well yes, you do need support. What you don't need is planned support. The difference is that support involves assistance with reasonably complex tasks while planned support involves making the product more difficult than necessary, to facilitate a support contract.

Generally speaking, companies that tout support as a "value-add" are doing so because their product has been designed for difficulty in maintenance, so you'll pay them for the ability to pick up the phone on occasion - most likely about a problem they created themselves by designing a poor product. These support contracts provide a good bit of residual income for large corporations by paying annually just to be on "standby". You need support much like you need aerosol spray - it comes in handy sometimes, but if you stop hanging out by the bathroom you won't need it as much. Not only do many companies provide poor support, but they provide it overseas in India - which is where you'll likely be calling when you have a problem.

In the open-source community, things are very different. The quality of the product is more important than generating a revenue stream from support, as the philosophy under which the project was originally started is likely to be more philanthropic in nature. And since open source projects are started with the expectation that other people will be using them (on a low level), they're usually designed to be understandable by other knowledgeable administrators. Any open-source product worth its time has been both well-written and well-documented to make it fairly easy for a sysadmin to implement and use on their network. If the admin gets stumped, the open-source community supports the growth of two primary areas of support:

Free Support

The open source community likes things free. Large community support forums have been home-grown for many open-source projects. This means implementors will be able to receive free support from experts in the field who are using the software hands-on - actual technical people who speak your native language and have actually seen the product they're supporting. In contrast, the commercial world leaves the poor systems administrator having to go through their sales contacts, a sales engineer, and potentially two or three other people before finding the answers they need - all while paying for their time. All the money that was sunk into a commercial support contract will start to seem like an awful lot as they hold the line for overrated support, that will most likely fail to provide any real answers to their problems (judging by the poor quality of customer service in today's technology marketplace). In the time it takes some to reach some technical specialists about products, they could have already had an answer from a community support forum or chatroom.

Support Contracts

As the popularity of a project grows, so do the number of people looking to earn a living supporting it. Open source developers have come to realize that corporations require support, and many have acommodated this requirement by providing paid support options. These are sometimes available directly through the spam filter author, or by others closely related to the project. This does several things - it promotes healthy competition among these groups, which helps keep support costs down, and it also means that if you stump your support group with a project, there are many other options available, as opposed to commercial support which lends itself to the cookie-cutter approach. Diverse support also means that there is a stronger likelihood you will find a group specializing in your specific area of interest (such as implementing product X on Solaris with an Oracle backend).

A Support Monoculture

Commercial software creates somewhat of a paid support mono culture. All the support you're going to receive generally emanates directly from the software manufacturer or, if they are large enough, from a value-added reseller who has trained their staff with the same learning materials. In other words, if someone can't solve your problem in the commercial world, nobody can. In the open-source world, there's a very diverse set of paid support options available from professional services consultants who specialize in open source, and are all based on different learning experiences.

There are bright, hard-working individuals in the open source industry who are so hands-on that they can probably solve your problem within a fraction of the time it would take a group of mediocre corporate support specialists. Bugs get fixed quicker, people respond faster, and all this at rock bottom prices. If you're thinking about the need for support, hook up with an open source support provider as you prepare to deploy the project on your network. If the project is any good, you shouldn't need nearly as much support as the marketing executives of the corporate world would have you believe.

Marketing Pickup Line #2: You need training

If vendors haven't managed to convince you that open source is a failure because there is "no support", the next thing they'll try to sell you on is that there is "no training" available. Training options in the open source world are very similar to support options. Professional consultants can provide whatever training is needed for whatever projects it is needed for. What's more important to consider when you're talking about such a tool as spam filters is why you need much training in the first place. An anti-spam solution should be simple to use - otherwise your customers won't use it, and you just wasted all your money on a commercial product - with great training and support - that nobody's interested in using. If you have to train "Grandma" how to use your spam solution, you're doing two things wrong. First, you're implementing a solution that's too complex which will not be used by many customers, and more importantly you're kissing any chance of a return on investment goodbye by spending all that extra money to provide technical support to the ones who call in with questions.

If you work for the average American corporation, you'll find it difficult to capitalize on technical support because customers demand it free. In this case, you're probably already looking for ways to make support less expensive. Your call centers may even be outsourced overseas, or filled with bottom-wage employees who have mastered the art of getting people off the phone without actually helping them. Every dime you spend teaching your users how to use the software is money you would have otherwise saved. If your solution requires a lot of end-user training, it may possibly end up costing more than managing the spam problem in the first place.

If a software vendor is touting training as a key selling point, this only means their software is so complex that you need training to use it. Next time one of their vendors tries to schmooze you and raises this point, ask them why you need training in the first place, if their product is so easy-to-use. If it's not easy-to-use, why would you want one?

A majority of open source anti-spam solutions have been designed to be very simple to use, even for Grandma. If a spam makes it past the filter, the user can forward or bounce the message in, click a link, or perform some other trivial task to train the filter. This also provides a sense of participation, which is something a lot of users want in today's world of privacy rights and service control. Not only do most solutions provide an easy-to-use interface like this, but they've been designed flexible enough for systems administrators to implement custom installs. Proprietary systems running IMAP or web-based email can easily be configured with a "Spam Folder" or some other type of device to make managing spam brain-dead easy for the user.

Marketing Pickup Line #3: You DON'T Need "Training"

The other extreme in touting training is that you don't need any training; that you can just plug the product in and make it work without the user needing to do anything. Steer very clear from these products - they are not true learning products! In many cases, static "out-of-the-box" products push the responsibility of training to the systems administrator or charge an annual subscription fee to keep the filters updated. Not only does this cost more money, but it provides very poor filtering with a high risk of errors, because all the filtering is centered around what someone else (the systems administrator or the software manufacturer) thinks about a user's mail, rather than what the user thinks about their own mail. If a user can't teach the filter their specific email behavior, the filter won't provide an acceptable level of results for the money. The ability for users to provide feedback into the system is important not only in training the software, but it gives the user a sense of satisfaction that they're able to do something about spam - rather than call the abuse department to complain.

Installing software on your network that's capable of only 95% filtering accuracy (and provides no feedback mechanism) is going to increase the likelihood that a customer will call in to tech support. Knowing such a system exists on the network will inevitably make users more critical of their inbox. Should they receive a single spam, many less savvy users think the filter isn't working correctly, and will call in to be a "good Samaritan" and let you know that they received a spam - they'll most likely want to forward it to an abuse address somewhere where more network traffic will be generated, and a human will have to respond to it. Add to this the livid customers who call to complain about lost email or false positives - users who are waiting on an important email, and call in because they believe the filter ate it, or find an email erroneously marked as spam that they feel is an inconvenience, and therefore want to make it tech support's inconvenience.

Lack of a feedback loop does more than hurt accuracy. It costs money. Be very wary of products touting the ability to perform without user participation.

Marketing Pickup Line #4: Commercial solutions are more scalable

Some commercial applications are scalable, but more aren't. In most cases, commercial solutions are bloated with non-statistical components that aren't necessary to good filtering. A lot of individuals buy these tools because they don't want to train all of their users' filters - a justifiable need. It's important to realize, however, that there are alternatives to the complete training of a statistical solution such as global seeding, merged groups, and other approaches that provide almost out-of-the-box filtering with little effort. Your mileage may vary in the scalability of open source projects, but there are at least a few whose execution time is measured in hundredths of a second.

The DSPAM agent runs with a very low execution time between 0.01s to 0.03s for classification and 0.03s to 0.10s for training, actual real time and on average desktop hardware. The CRM114 discriminator is similarly fast in performing between 0.05s and 0.10s execution time for classification. Plenty of open source tools outperform even the most expensive commercial products on the market. Not to say a commercial product isn't capable of performing well, but they are certainly not more scalable than what's freely available. Many open source projects have been deployed on systems with several hundred thousand users - there's no justification to suggest that a commercial application could do any better.

In fact, when corporations begin to scale to this many users, there is usually a dramatic cost difference between commercial and open-source implementations. Even a scalable commercial implementation will generally cost more to implement in licensing and support contracts than an open-source solution.

Marketing Pickup Line #5: Commercial solutions are more accurate

Trust no-one. This is quite the contrary for this specific area of technology. In the setting of spam filtering, a majority of commercial solutions today are advertising levels of accuracy from last-generation filtering - somewhere between 95% - 99%. This means roughly between one and five errors per hundred emails! There are a few that tout five-nines accuracy but many of them are just flat out lying, or require your users to manage whitelists or challenge/response mechanisms. Users of one particular filter making this claim (rhymes with blightmail) have reported filtering rates falling as low as into the mid-80's without whitelisting. Well-written open-source filters have achieved rates of 99.5% to 99.9% and beyond with little effort. This means between one and five errors per thousand emails. That's right, they're more than ten times as accurate as commercial solutions! A few open source filters have even managed to achieve close to five-nines accuracy, with the highest peak recorded at 99.991% using purely statistical methods of filtering.

The problem with the industry today is that these numbers are getting thrown around enough to confuse unsuspecting managers who flunked math in high school. Is there really a difference between 99% and 99.9%? A big one! (10/1000 spams vs. 1/1000). Unfortunately, people seem to forget how to do math rather quickly when in the presence of a pretty server.

If you're measuring ROI, accuracy is crucial. Inaccuracy costs a company money. Money translates to bandwidth, server resources and people time to answer complaints or manage spam. And if your filter performs too poorly, filtering itself might be so useless that you still have to delete mail in chunks. There's a significant loss of productivity in the users who have to delete the spam (something that's important if you're paying these people to do something). An error prone solution will cost:

  • Money for the initial purchase and installation of the equipment

  • The additional bandwidth to cover tens of thousands of extra spams

  • The additional server resources to cover tens of thousands of extra messages

  • Several hours of total productivity to delete spam

  • Loss of productivity for loss of legitimate mail

  • Additional salaries paid to cover increasing tech support expenses

Think about the total amount of money spent on resources, phone calls, and time and you'll see that the price in paying for an error-prone solution only capable of achieving 99% is really far more than the sticker price. Inaccuracy costs more than accuracy. Solutions are available which cost considerably less to implement and provide higher levels of accuracy, or rather lower levels of inaccuracy.

The Death of Old Technology

As I mentioned, most commercial anti-spam solution manufacturers are still using the old heuristic approach to filtering in order to generate monthly revenue, and that's unfortunately giving the spam filtering space the image of a snake-oil salesman. Spamming on behalf of anti-spam solutions doesn't help perception either. As these commercial products age, the annual subscription keeps their customers paying for what would otherwise become an entirely useless product. Most companies are willing to pay an annual subscription to maintain the status quo of the technology industry - we all pay support and maintenance. People don't expect anything more because most applications are static and require babysitting.

As we move into the world of AI, we've opened up a very dangerous can of worms to this standardized way of doing business - or a very refreshing one. Our AI tools are capable of learning how to improve, and actually do their job better as the software becomes older. The danger to monthly residual is that these tools could sit on a network for five years collecting dust - and still perform better than the latest modelheuristic filters out-of-the-box within a few weeks time. This is something to be very concerned about if you're selling the obsolete technology most manufacturers are selling today, but it's also very comforting for the few who understand the vision behind this AI technology and are forming business models around it.

Preventing a Mono-Culture

There are a lot of different filtering appliances out there, and this is fortunate in that it helps to prevent a mono culture. That's not to say that any of these companies wouldn't like to be on top. The problem many companies are challenged with, and beginning to encounter, is that because their appliances are fairly static, spammers are becoming some of their customers, and using their machines to test how well their messages will get around the filter. It's pretty easy to change a spam around if you're able to run it through the target filter every time until it finally gets accepted, and with a dumbed-down Bayesian element this is much easier. The adaptive learning provided by true statistical filters is the only solution to this, and makes this practice impossible. It's extremely difficult for a spammer to construct a message that will circumvent a large number of Bayesian filters. This is because each user's filter is based on the user's own personalized behavior. There are plenty of dirty little tricks spammers have tried to use in the past to circumvent filters and they only appear to work on these older heuristic code bases - our adaptive learning filters are seeing right through their tricks. On top of this, open source filters have the added advantage of being successful amidst also being completely exposed. Their full source code available, you can be certain that any spammer who wants to circumvent an open source filter would be looking for loopholes in the code. If any are found, they're quickly discovered and patched. Because open source projects are commonly community-based efforts, they have the advantage of a large-scale, multitalented development group who is motivated by creativity, rather than salary.

Advanced algorithms such as Bayesian Noise Reduction make it computationally infeasible to perform many of the more advanced attacks. Spam is ever-changing, and that yearly check sent into the spam filter manufacturer is only a leash. Statistical filtering gets better the more you use it. The biggest fear of these present-day filter manufacturers is the fear that someday it begins to catch on that there are other (free) solutions out there that get better with time. It's easy to see why there are so many companies out there using buzzwords and avoiding statistical filtering - because they lose their leash.

Maintenance

Finally, maintenance stands the chance of hammering the final nail into the coffin of heuristic filtering. Maintenance between statistical filters and the heuristic filters of yesterday is very different. Heuristic filters demand the attention of the systems administrators or monthly subscription for automatic updates (spam detection rules coming from complete strangers); frequent updates must be installed or transmitted to counter the dynamic nature of spam with new rules. This is ideal if you plan on having a dedicated anti-spam administrator (or a group of them), but most companies don't want to spend an extra hundred thousand dollars on additional employees just to support the so-called "solution" that can't really perform very well on its own anyway. Why have one man doing 100% of the work when you can have all of your customers doing 1/10th of a percent of the work? Not to say that each user must train from scratch, as many statistical tools allow for a global database to start all their users off initially, but forwarding an occasional spam into the system is certainly not what you want to be paying your employees to do. Distributing the responsibility out to the end-user does two things. First, it frees up the staff to work on other projects (nobody wants a dedicated spam guy, especially the guy who's appointed as the dedicated spam guy). It also prevents a total stranger from making decisions about what your filter thinks is spam. Second, it makes each user responsible for their own filtering. Users who don't want to filter themselves merely don't participate. Users who diligently mark spams and correct errors are rewarded with more accurate filtering. This will please the many users who have censorship issues by allowing them to censor themselves. Users want to feel like part of the solution; they have an inner-urge to want to forward the spams they receive somewhere. Why not take advantage of this and allow them to participate. For large implementations where this is not possible, the global database concept works - set up a global database to provide out-of-the-box filtering, and let your users customize the filter's behavior by occasional training.

Final Thoughts

The consistent fear manufacturers have is that AI makes decisions for people, so that you don't need people in the loop. In reality, AI does make decisions for people, but not the important ones. Why should people have to devote their time to determining if messages are spam? Why should support groups be necessary to answer first-level requests for information? A lot of companies are scared of AI, and with good reason. The companies who are manufacturing tools that don't adapt or help learn how to make decisions will eventually be left by the wayside if they don't change. There will be a time of adaptation to this new technology, but like setting fire to a field, what gets burned away will be replaced with something much better. AI is here to stay, and has been mastered by the open source community.


In Stores Now!
From the author of DSPAM
Learn How Spam Filters Work:
and how most commercial solutions don't


 All Website Content © 2005 Jonathan A. Zdziarski. All Rights Reserved.
Reproduction prohibited without permission