I define spam as email which is both unsolicited and sent in bulk. By
unsolicited, I mean that the sender does not have permission from the
recipient to send them email (we can quibble about how much permission
is implied by existing business relationships, but I won't go into that
here). By bulk email, I mean substantially similar messages sent to a
large number of recipients. By this definition, spam includes both the
perennial get rich quick and penis enlargement schemes, but also
unsolicited bulk political or religious messages, or badly run mailing lists
where some recipients were signed up without their permission.
For the purpose of this article, a filter is software which prevents the
intended recipient from seeing the spam. When we're comparing filters,
we talk in terms of the numbers of false positives (that is,
non-spam which gets filtered) and false negatives (that is, spam
which is not filtered). Most people, unless they're drowning in spam,
consider false positives much worse than false negatives. A simple way
for individual users to eliminate the most serious false positives is to
place a white listing system in front of any filters, so that
mail from people you know always gets through.
I won't be talking about what you can do to avoid getting spammed in the
first place, about legal remedies for spam, or about revisions to
Fixing the holes in the system
A computer connected to the Internet which acts as an SMTP relay to
anyone, or which will proxy TCP/IP connections for anyone, provides the
spammer with a way to hide his own IP address, avoiding complaints to
his ISP. With SMTP relays, the spammer can hand off the job of
delivering the mail to the relay itself (since SMTP allows the
spammer to specify multiple recipients across many domains for a single
message body). Years ago, the main hole used by spammers was the open
SMTP relay. Now, many machines are permanently on broadband connections
with badly configured connection sharing software (such as SOCKS or HTTP
proxies), and spammers are abusing these open proxies more and more.
Both types of hole can be dealt with by blacklists of the IP addresses
of the insecure machines. The operators of these blacklists will test
suspected open machines and publish the IP addresses of those they find
to be open via the
DNS. They might take public submissions for addresses to test,
or may test machines which connect to them and attempt to send email.
Testing itself is controversial since it is arguably behaving like the
spammers themselves: it's not clear how someone receiving a proxy or
relay probe can tell that it doesn't originate from yet another spammer.
That said, these blacklists are much less controversial than those which
list based on human decision rather than automated testing (more of
which below). It is a common
mistake to confuse the automated lists with those based on human
opinion, since they are both distributed via the DNS.
Most of the big Unix mail transports (like Sendmail or Exim) support
using DNS-based blacklists to reject connections or tag suspect
mail. On the Windows desktop, you can use a free tool like Spampal to check the headers of
your mail for blacklisted IP addresses. Blacklists of open machines
include ORDB (for open relays)
and BOPM (for open
Such schemes can be expected to have a low false positive rate (although
higher if you do business with small companies who may have incompetent
sysadmins), and a moderate false negative rate. Unfortunately,
blacklisting open machines won't completely deal with spam. For one
thing, there are just too many open proxies, so the lists are always a
little behind the times.
Recognising spam by the sender (or their ISP)
Some spammers send from their own IP addresses, or just send from ISPs
which don't care about spam. As well as allowing spammers to send email,
ISPs which aren't clued up about spam may host spammers' websites or
provide other spam
support services, arguing that because they're not allowing the
spammer to spam from their own IP addresses, they're not doing anything
wrong. These days, this argument doesn't wash. Providing a stable
website for a spammer is aiding and abetting spam.
There are DNS-based blacklists which will list IP addresses belonging to
spammers and to spam-friendly ISPs. This involves human judgement as to
whether an IP range belongs to a spammer, and whether an ISP is doing
enough to deal with the spammers they host. The original example of this
sort of blacklist was the MAPS
RBL. These days, after the RBL started charging for use and was
forced into climb downs by legal action, other similar lists have sprung
up, from the relatively conservative SBL to the very enthusiastic
While this type of blacklist is often maligned for being overly broad,
the false positive rate very much depends on the blacklist you use:
SPEWS's policy of gradually expanding their listings to cover the IP
space of unresponsive ISPs can be expected to generate many more false
positives than the SBL.
People wanting to use these blacklists can use the same software as
you'd use for open relay or proxy lists, as these are all DNS-based
lists. Because such blacklists will list for "spam support", it's also
worth filtering on URLs in
the message body by looking up the IP of the corresponding host
and checking that against the blacklist (disclosure: my own free
software is behind that link).
Another approach is to recognise known senders and force the people who
have not mailed you before to prove they're humans rather than computer
programs. Using programs like TMDA,
mail from unrecognised senders is held in a "jail", and the sender gets
an automated response which asks them to click on a link or send another
response to get their initial mail out of jail. While this approach has
a very low false negative rate, it wouldn't work if everyone did it, and
it may cause some of your correspondents to just give up without
confirming that they are human, effectively causing false positives.
Because this approach trusts the sender address provided in the headers
and sends many substantially identical messages
itself, it can have unfortunate
interactions with other filtering systems. It's unlikely that a
business could get away with using this method.
Recognising spam by its content
A lot of spam advertises penis enlargement, porn, Viagra, pyramid
schemes and so on. Some filters work by recognising the content of the
message as falling into one of these types.
The simplest approach of all is to look for key words in the message
body, such as "Viagra" or "Nigeria". If we find one, we consider the
message to be spam. Unfortunately, this causes
much legitimate email to be rejected because it merely mentions one
of the "bad words". This method has a stupidly high false positive rate
(and a reasonably high false negative rate once the spammers learn not
to use certain words), but despite that, the complaints from mailing
list operators would indicate that many expensive "enterprise" solutions
seem to implement some sort of keyword system. Beware of snake oil.
If we want to get a bit more sophisticated, we can combine keywords and
phrases with some sort of scoring system, so that it takes more than a
single mention of a bad word to get a message filtered. By adjusting the
scores, we can filter spam pretty effectively. Spam Assassin works in this
way (as well as incorporating blacklist checks and just about every
other filtering technique mentioned here), and is reported to have a low
false positive and low false negative rate, provided you keep up to date
with the latest versions containing the latest key phrases.
Following the idea of a scoring system to its conclusion, we can use
machine learning techniques to train the scoring system on examples of
spam and non-spam email, allowing it to adjust its own scores. Bayesian
inference is a popular learning technique at the moment, with programs
like Popfile , Spambayes, and many others providing free Bayesian
filters. Bayesian filters are reputed to be very effective, with low
false positive and false negative rates.
That said Jeremy
Bowers argues persuasively that
human malice can always defeat such automated classification, and will
do so eventually. Some spammers are already avoiding the key phrases which
Spam Assassin looks for.
Recognising spam by its bulkiness
A neat objection to all of these techniques is that they're not actually
filtering based on the properties of spam I mentioned earlier, namely
mail that is unsolicited and bulk. If we can directly detect bulk email,
and white list all solicited sources of bulk email (such as
mailing lists we signed up for), the bulk email which remains is spam,
Checksum Clearinghouse works by taking message digests
(known as hashes or checksums) of all mail passing through a
server (or a set of co-operating servers, hence the "distributed" in the
name). By counting the number of times we've seen a particular digest,
we can tell how many times we've seen that message. Above a certain
number of messages, the email is considered bulk and is either white
listed or filtered out. The digest functions used by these schemes are
constantly changing in an attempt to ignore the "hash busters" and other
personalisations inserted by spammers. Such bulkiness detection schemes
have very low false positive rates, providing you remember to white list
your solicited bulk email, and reasonably low false negative rates.
Unfortunately there's no version of the DCC client for Windows desktops
at the moment, although it
wouldn't be hard to make one.
Razor and the Windows version, Cloudmark Spamnet, is a similar
scheme, but instead of taking digests of all mail, it relies on users
manually reporting the digests of spam (or on scripts which report mail
sent to "spam traps") and has some kind of undisclosed trust or
moderation network to attempt to eliminate malicious reporting (myself,
sure that can ever work). Brightmail operate a similar,
but non-free, scheme, and various ISPs seem to operate their own similar
schemes behind the scenes.
Ironically, just as Bayesian filters may eventually lead to an arms race
which will end when only a human can tell spam from legitimate email, so
bulkiness detection may eventually produce spam which only a human can
tell is a form letter. We can take some hope from the fact that evading
all these filtering techniques often requires mutually contradictory
responses from the spammers.
The politics of filtering
Filtering only helps those who have filters. Filtering may in fact
reduce the number of useful complaints an ISP will receive about their
spammers, since the skilled Internet users will have good filters and
will only complain about spam which makes it through the filters. So,
the spammer remains connected and can continue to spam people who don't
have filters, who are also likely to be the people most vulnerable to
scams. That said, it's difficult to see what can be done about this. The
Internet is now as full
of idiots as any other public place, and few would argue that
filter users have a responsibility to protect them all.
Filtering reduces the effectiveness of the email system. If my mail
somehow looks like spam, your filter might swallow it, and I may never
know (bouncing rejected mail is problematic,
and some filters won't do it). Some people simply filter messages which
look like spam into a separate folder rather than discarding or bouncing
them, which cuts down on false positives but is effectively just
shifting the time you spend reading spam (though that in itself is a
good thing if you're someone who checks their mail every time new mail
Right now there's a choice between accepting this degradation of the
system or accepting that caused by spam. Fortunately, by carefully
choosing your filters, the number of false positives can be kept very
low. That might change in the future, leading to speculations about
alterations to the infrastructure of email. But that's another story,
which I might tell another time.
What I do
I guess people are bound to ask the author of an article like this what
he does about filtering. Right now, I use only the DCC to filter email.
A few spams are slipping through because they're short enough that the
DCC doesn't want to checksum them: such spams mostly contain a few words
and a URL. A highly effective means of filtering these is to filter mail
which contains only an HTML part with no text alternative, as I believe
there are no legitimate mailers which send such mail (even Outlook
Express includes the alternative textual part). I'm also thinking of
adding the check on URL hosts against the SBL, as mentioned above, and
also a check of the headers against open proxy blacklists.