Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Spam filtering: a whistle-stop tour

By pw201 in Internet
Sun Mar 09, 2003 at 05:57:03 PM EST
Tags: Technology (all tags)
Technology

Spam is a problem for most people who use email. Some people believe that it threatens the usefulness of email as a means of personal communication. Filtering, in many different forms, is a possible way of dealing with the spam problem. In this article, we'll take a whistle-stop tour of filtering techniques, and I'll point out some examples of software to do filtering. Wherever possible, I'll mention software which costs nothing and is available for both Unix and Windows.


The problem

I define spam as email which is both unsolicited and sent in bulk. By unsolicited, I mean that the sender does not have permission from the recipient to send them email (we can quibble about how much permission is implied by existing business relationships, but I won't go into that here). By bulk email, I mean substantially similar messages sent to a large number of recipients. By this definition, spam includes both the perennial get rich quick and penis enlargement schemes, but also unsolicited bulk political or religious messages, or badly run mailing lists where some recipients were signed up without their permission.

For the purpose of this article, a filter is software which prevents the intended recipient from seeing the spam. When we're comparing filters, we talk in terms of the numbers of false positives (that is, non-spam which gets filtered) and false negatives (that is, spam which is not filtered). Most people, unless they're drowning in spam, consider false positives much worse than false negatives. A simple way for individual users to eliminate the most serious false positives is to place a white listing system in front of any filters, so that mail from people you know always gets through.

I won't be talking about what you can do to avoid getting spammed in the first place, about legal remedies for spam, or about revisions to mail protocols.

Fixing the holes in the system

A computer connected to the Internet which acts as an SMTP relay to anyone, or which will proxy TCP/IP connections for anyone, provides the spammer with a way to hide his own IP address, avoiding complaints to his ISP. With SMTP relays, the spammer can hand off the job of delivering the mail to the relay itself (since SMTP allows the spammer to specify multiple recipients across many domains for a single message body). Years ago, the main hole used by spammers was the open SMTP relay. Now, many machines are permanently on broadband connections with badly configured connection sharing software (such as SOCKS or HTTP proxies), and spammers are abusing these open proxies more and more.

Both types of hole can be dealt with by blacklists of the IP addresses of the insecure machines. The operators of these blacklists will test suspected open machines and publish the IP addresses of those they find to be open via the DNS. They might take public submissions for addresses to test, or may test machines which connect to them and attempt to send email. Testing itself is controversial since it is arguably behaving like the spammers themselves: it's not clear how someone receiving a proxy or relay probe can tell that it doesn't originate from yet another spammer. That said, these blacklists are much less controversial than those which list based on human decision rather than automated testing (more of which below). It is a common mistake to confuse the automated lists with those based on human opinion, since they are both distributed via the DNS.

Most of the big Unix mail transports (like Sendmail or Exim) support using DNS-based blacklists to reject connections or tag suspect mail. On the Windows desktop, you can use a free tool like Spampal to check the headers of your mail for blacklisted IP addresses. Blacklists of open machines include ORDB (for open relays) and BOPM (for open proxies).

Such schemes can be expected to have a low false positive rate (although higher if you do business with small companies who may have incompetent sysadmins), and a moderate false negative rate. Unfortunately, blacklisting open machines won't completely deal with spam. For one thing, there are just too many open proxies, so the lists are always a little behind the times.

Recognising spam by the sender (or their ISP)

Some spammers send from their own IP addresses, or just send from ISPs which don't care about spam. As well as allowing spammers to send email, ISPs which aren't clued up about spam may host spammers' websites or provide other spam support services, arguing that because they're not allowing the spammer to spam from their own IP addresses, they're not doing anything wrong. These days, this argument doesn't wash. Providing a stable website for a spammer is aiding and abetting spam.

There are DNS-based blacklists which will list IP addresses belonging to spammers and to spam-friendly ISPs. This involves human judgement as to whether an IP range belongs to a spammer, and whether an ISP is doing enough to deal with the spammers they host. The original example of this sort of blacklist was the MAPS RBL. These days, after the RBL started charging for use and was forced into climb downs by legal action, other similar lists have sprung up, from the relatively conservative SBL to the very enthusiastic SPEWS.

While this type of blacklist is often maligned for being overly broad, the false positive rate very much depends on the blacklist you use: SPEWS's policy of gradually expanding their listings to cover the IP space of unresponsive ISPs can be expected to generate many more false positives than the SBL.

People wanting to use these blacklists can use the same software as you'd use for open relay or proxy lists, as these are all DNS-based lists. Because such blacklists will list for "spam support", it's also worth filtering on URLs in the message body by looking up the IP of the corresponding host and checking that against the blacklist (disclosure: my own free software is behind that link).

Another approach is to recognise known senders and force the people who have not mailed you before to prove they're humans rather than computer programs. Using programs like TMDA, mail from unrecognised senders is held in a "jail", and the sender gets an automated response which asks them to click on a link or send another response to get their initial mail out of jail. While this approach has a very low false negative rate, it wouldn't work if everyone did it, and it may cause some of your correspondents to just give up without confirming that they are human, effectively causing false positives. Because this approach trusts the sender address provided in the headers and sends many substantially identical messages itself, it can have unfortunate interactions with other filtering systems. It's unlikely that a business could get away with using this method.

Recognising spam by its content

A lot of spam advertises penis enlargement, porn, Viagra, pyramid schemes and so on. Some filters work by recognising the content of the message as falling into one of these types.

The simplest approach of all is to look for key words in the message body, such as "Viagra" or "Nigeria". If we find one, we consider the message to be spam. Unfortunately, this causes much legitimate email to be rejected because it merely mentions one of the "bad words". This method has a stupidly high false positive rate (and a reasonably high false negative rate once the spammers learn not to use certain words), but despite that, the complaints from mailing list operators would indicate that many expensive "enterprise" solutions seem to implement some sort of keyword system. Beware of snake oil.

If we want to get a bit more sophisticated, we can combine keywords and phrases with some sort of scoring system, so that it takes more than a single mention of a bad word to get a message filtered. By adjusting the scores, we can filter spam pretty effectively. Spam Assassin works in this way (as well as incorporating blacklist checks and just about every other filtering technique mentioned here), and is reported to have a low false positive and low false negative rate, provided you keep up to date with the latest versions containing the latest key phrases.

Following the idea of a scoring system to its conclusion, we can use machine learning techniques to train the scoring system on examples of spam and non-spam email, allowing it to adjust its own scores. Bayesian inference is a popular learning technique at the moment, with programs like Popfile , Spambayes, and many others providing free Bayesian filters. Bayesian filters are reputed to be very effective, with low false positive and false negative rates.

That said Jeremy Bowers argues persuasively that human malice can always defeat such automated classification, and will do so eventually. Some spammers are already avoiding the key phrases which Spam Assassin looks for.

Recognising spam by its bulkiness

A neat objection to all of these techniques is that they're not actually filtering based on the properties of spam I mentioned earlier, namely mail that is unsolicited and bulk. If we can directly detect bulk email, and white list all solicited sources of bulk email (such as mailing lists we signed up for), the bulk email which remains is spam, by definition.

The Distributed Checksum Clearinghouse works by taking message digests (known as hashes or checksums) of all mail passing through a server (or a set of co-operating servers, hence the "distributed" in the name). By counting the number of times we've seen a particular digest, we can tell how many times we've seen that message. Above a certain number of messages, the email is considered bulk and is either white listed or filtered out. The digest functions used by these schemes are constantly changing in an attempt to ignore the "hash busters" and other personalisations inserted by spammers. Such bulkiness detection schemes have very low false positive rates, providing you remember to white list your solicited bulk email, and reasonably low false negative rates.

Unfortunately there's no version of the DCC client for Windows desktops at the moment, although it wouldn't be hard to make one.

Razor and the Windows version, Cloudmark Spamnet, is a similar scheme, but instead of taking digests of all mail, it relies on users manually reporting the digests of spam (or on scripts which report mail sent to "spam traps") and has some kind of undisclosed trust or moderation network to attempt to eliminate malicious reporting (myself, I'm not sure that can ever work). Brightmail operate a similar, but non-free, scheme, and various ISPs seem to operate their own similar schemes behind the scenes.

Ironically, just as Bayesian filters may eventually lead to an arms race which will end when only a human can tell spam from legitimate email, so bulkiness detection may eventually produce spam which only a human can tell is a form letter. We can take some hope from the fact that evading all these filtering techniques often requires mutually contradictory responses from the spammers.

The politics of filtering

Filtering only helps those who have filters. Filtering may in fact reduce the number of useful complaints an ISP will receive about their spammers, since the skilled Internet users will have good filters and will only complain about spam which makes it through the filters. So, the spammer remains connected and can continue to spam people who don't have filters, who are also likely to be the people most vulnerable to scams. That said, it's difficult to see what can be done about this. The Internet is now as full of idiots as any other public place, and few would argue that filter users have a responsibility to protect them all.

Filtering reduces the effectiveness of the email system. If my mail somehow looks like spam, your filter might swallow it, and I may never know (bouncing rejected mail is problematic, and some filters won't do it). Some people simply filter messages which look like spam into a separate folder rather than discarding or bouncing them, which cuts down on false positives but is effectively just shifting the time you spend reading spam (though that in itself is a good thing if you're someone who checks their mail every time new mail arrives).

Right now there's a choice between accepting this degradation of the system or accepting that caused by spam. Fortunately, by carefully choosing your filters, the number of false positives can be kept very low. That might change in the future, leading to speculations about alterations to the infrastructure of email. But that's another story, which I might tell another time.

What I do

I guess people are bound to ask the author of an article like this what he does about filtering. Right now, I use only the DCC to filter email. A few spams are slipping through because they're short enough that the DCC doesn't want to checksum them: such spams mostly contain a few words and a URL. A highly effective means of filtering these is to filter mail which contains only an HTML part with no text alternative, as I believe there are no legitimate mailers which send such mail (even Outlook Express includes the alternative textual part). I'm also thinking of adding the check on URL hosts against the SBL, as mentioned above, and also a check of the headers against open proxy blacklists.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Related Links
o Some people
o badly run mailing lists
o SMTP relay to anyone
o SMTP
o via the DNS
o common mistake
o Spampal
o ORDB
o BOPM
o spam support services
o MAPS RBL
o SBL
o SPEWS
o filtering on URLs in the message body
o TMDA
o unfortunat e interactions
o causes much legitimate email to be rejected
o Spam Assassin
o Popfile
o Spambayes
o many others
o Jeremy Bowers argues persuasively
o Distribute d Checksum Clearinghouse
o message digests
o it wouldn't be hard
o Razor
o Cloudmark Spamnet
o not sure that can ever work
o Brightmail
o as full of idiots
o problemati c
o Also by pw201


Display: Sort:
Spam filtering: a whistle-stop tour | 52 comments (46 topical, 6 editorial, 0 hidden)
How true it is! (5.00 / 2) (#2)
by acceleriter on Sun Mar 09, 2003 at 10:40:40 AM EST

Testing itself is controversial since it is arguably behaving like the spammers themselves: it's not clear how someone receiving a proxy or relay probe can tell that it doesn't originate from yet another spammer.

I inadvertently reported a message from myself to SpamCop (yeah, I know, I'm an idiot--it was to an address which is normally only a spam trap), and, not having seen the cause and effect, reported the portscan from wirehub.nl. Now I don't normally bother reporting portscans, lest I be perceived by network admins as a BlackIce-wielding loser wannabe. But this was over four hours, and tried every damn port--not just your standard 1080, 3128, 137 from China.

The SMTP relay tests, though, were more obvious in purpose, and didn't end up freaking me out.

Nice article, BTW.

that reminds me (3.25 / 4) (#3)
by VoxLobster on Sun Mar 09, 2003 at 12:03:08 PM EST

time to check my hotmail account.

VoxLobster
I was raised by a cup of coffee! -- Homsar

I don't think machine learning is that hopeless (4.28 / 7) (#9)
by Delirium on Sun Mar 09, 2003 at 04:51:45 PM EST

It's true that a malicious human can trick Bayesian filtering, but Bayesian filtering is one of the simplest types of machine learning algorithms. It's basically a naive statistical classifier that operates on fairly low-level entities (usually words in the implementations I've seen) and doesn't have much in the way of ability to generalize higher-order concepts (like any sort of meaningful ability to discover English semantics). Moving towards more computationally-intensive and trickier-to-train but more powerful systems (any of the variety of well-studied neural network architectures, for example) may improve things in this regard, but I'd consider that to be at least a few years off.

In any case, my point was that I don't think it's at all true that there will ever come a time when in principle it is impossible for anyone but a human to tell spam from non-spam.

Possibly not hopeless, but hard (4.00 / 1) (#11)
by pw201 on Sun Mar 09, 2003 at 06:03:19 PM EST

I've a vague memory that Bayesian learning based on phrases actually does worse than just using words alone, although I can't remember where I read this, so don't take my word for it.

There is currently a strain of spam which tries to look like misaddressed mail and happens to mention this great site, but it's so poorly done that it's very unconvincing (to me, at least). However, we can expect the spammers who don't go to the wall to get better at this.

The article from Jeremy Bowers gives this example of "spam of the future":

Subject: Re: Re: the proposal

That's a nice point, but I think you should consider the information at http://www.somewebsite.com/info.html before going with that approach. I found that information to be really pertinent.

It's difficult to see how anything short of true AI can distinguish this from legitimate email on the basis of content alone. The debate is about whether this is where the spammers actually want to end up. Tim Peters, Python guru and one of the spambayes guys, pointed out that spam which is so bland as to be indistinguishable from ordinary mail isn't very good advertising, since most advertising is designed to stand out. We could talk about things like teaser campaigns or the rumours of people being paid to contribute subtle advertising to discussion groups and so on, but it's clear that there's some advantage to advertising being fairly direct. It'll certainly be interesting to see how the spammers do respond to advances in filtering: I'm re-reading The Blind Watchmaker at the moment, and the parts about bird's tails and arms races seem quite relevant here.



[ Parent ]

Bayesian learning isn't that hard (5.00 / 3) (#13)
by lakeland on Sun Mar 09, 2003 at 06:55:33 PM EST

Ok, writing a bayesian filter that will successfully recognise extremely tricky spam while ignoring dubious sounding valid email is hard. But you have missed one of the most valuable features of bayesian filtering: Everybody's filter is different. A spammer trying to craft a valid sounding email may well get through some filters who get similar email, but most filters will continue to block them. Currently it is economically viable to send a lot of spam because a small percentage is answered. If more effort is needed to craft every spam, and even then it is usually discarded before being read, then the economics of spam stops looking so good. Besides, even the most convincing spam is caught by distributed checksumming (Razor, DCC, etc). All that is needed is to integrate that and bayesian filtering and nothing can get through for long. Oh, and FWIW, the reason you would think phrase based bayes is worse is sample size. Say there are 10k words, zipf says you will see the most common word 10k times before you see the 10'000th word once. So the co-occurance count of two rare words is not 10k^2, but 10k^4. I.e. you'd have had to have seen many, many billions of emails to have seen two rare words cooccur at all. Basically the phrase approach ends up using the 'unknown likelihood' all the time. Of course, smoothing would help, but remember that the current bayesian filters are just proof-of-concept.

[ Parent ]
Personalisation (none / 0) (#47)
by pw201 on Tue Mar 11, 2003 at 08:06:21 AM EST

I don't believe personalisation will defeat messages which are valid English but contain neither strong spam indicators nor strong "ham" (not-spam) indicators. The article I linked to from jerf.org links to another posting which explains why.

[ Parent ]
"true AI" (none / 0) (#17)
by Delirium on Sun Mar 09, 2003 at 07:53:32 PM EST

I suppose that was my point -- "true AI" is not impossible, and is already becoming a reality in many limited domains (and I'd argue spam-detection is a suitably limited domain, though a bit more open-ended than say playing Chess).

[ Parent ]
That's okay... (4.50 / 2) (#22)
by bdesham on Sun Mar 09, 2003 at 09:56:32 PM EST

[Bayesian filtering] doesn't have much in the way of ability to generalize higher-order concepts (like any sort of meaningful ability to discover English semantics).

...most of the spam that I get doesn't have much in the way of meaningful English semantics, either :-)

--
"Scattered showers my ass." -- Noah
[ Parent ]
Computer spam detection (5.00 / 1) (#27)
by Jerf on Sun Mar 09, 2003 at 10:50:22 PM EST

It's not possible to claim with certainty that a computer will never distinguish spam, because it's not possible to say with certainty that computers will always remain dumber then humans.

However, should a true Natural Language Processing technology develop, the least of its applications will be true spam filtering. There are several "magic" technologies that if they exist, all bets are off in terms of computer's capabilities, and I try to remember to disclaim that as often as possible ;-) My claim that computers can't beat humans is predicated on there being no breakthrough in AI. . . it's a little circular, but a lot of people don't realize that it's true, so it's worth saying.

Oh, and there's a reason we're using Bayesian: It works best under these circumstances. (Current) Neural nets do not work well on text, for two big reasons: One, there's no good way to represent arbitrary text for a neural network. An input for every possible word? Some encoding that drops out tons of potentially importent information and characterizes all emails by 20 numbers? There's no clean way to do it, which translates into poor behavior because there's no clear way to develop.

Two, neural nets are slow, slow, slow. Bayesian filtering gives decent results extremely quickly, and the refining process works well; less then 50 classified messages, with a good mix, and the filter (on current spam) is well into the 90% range of accuracy. A neural net approach would require (assuming a non-existant good encoding!) hundreds or thousands of properly-classified messages, many passes to train the net, and no guarentee of success, even if a good solution exists, because neural nets start randomly and the variation caused by differing initial conditions is astonishing.

Neural nets are often used in conjunction with other techniques, but in this case I don't see any (obvious!) way they could amplify Bayes' advantages. (Maybe somebody else does, of course. ;-) )

Every other technology lumped into "machine learning" I can think of is similarly slow; Bayes is a bit of a fluke in a way in that it can give such good results so quickly. "Slow" here not in computational terms, but in terms of requiring the user to classify "hundreds" or "thousands" of messages before giving good results, which is slow in human terms; the average user will stop well before the effort pays off. Part of that is a side-effect of the fact that spam and non-spam are so different right now. Bayesian is just really good in this application, especially with the ability for the human to correct the algorithm in real-time fairly easily (one click per message in the mozilla implementation, for instance; excellent convenience. integrating with the mail app is definately the way to go!).

[ Parent ]

neural nets, etc. (4.66 / 3) (#28)
by Delirium on Mon Mar 10, 2003 at 12:03:27 AM EST

It's not possible to claim with certainty that a computer will never distinguish spam, because it's not possible to say with certainty that computers will always remain dumber then humans.
Well, yes, that's kind of what I meant -- that it's not in principle possible to claim that computers will remain dumber than humans. However, to be "good enough" (i.e. darn near close to 100% correct classification) I don't think requires a computer with fully human-level (or better) intelligence. I do think there's room for improvement past Bayesian filtering, even with current technology.
Oh, and there's a reason we're using Bayesian: It works best under these circumstances. (Current) Neural nets do not work well on text, for two big reasons: One, there's no good way to represent arbitrary text for a neural network. An input for every possible word? Some encoding that drops out tons of potentially importent information and characterizes all emails by 20 numbers? There's no clean way to do it, which translates into poor behavior because there's no clear way to develop.
Yeah, this is always a problem with neural nets. I've done a bit of work on using neural nets to classify texts by authorship, and in searches of the literature found quite a few approaches to representing text in neural nets. One is using an encoding for every possible word (the simplest would be 0-26 for A-Z and then each word is a string of arbitrarily long integers) which is then compressed to a fixed-size encoding using a recursive auto-associative memory (RAAM) network. Most of the RAAM work has been on linguistic transformations (i.e. turning active-voice sentences to passive voice) since the compressed representations show a promising ability to be operated on directly (i.e. if a' is the compressed version of a, you can do a -> a' -> b' -> b). There are some simpler approaches, including using a fixed representation of some corpus of words (perhaps the 1000 most common words) and dropping words per se entirely and representing them by grammatical tokens (i.e. "verb", "noun", etc.). In any case, I don't think the representational problem is completely intractable.
Two, neural nets are slow, slow, slow. Bayesian filtering gives decent results extremely quickly, and the refining process works well; less then 50 classified messages, with a good mix, and the filter (on current spam) is well into the 90% range of accuracy. A neural net approach would require (assuming a non-existant good encoding!) hundreds or thousands of properly-classified messages, many passes to train the net, and no guarentee of success, even if a good solution exists, because neural nets start randomly and the variation caused by differing initial conditions is astonishing.
Yes, neural nets would not be a good candidate for on-line training. I had envisioned something more along the lines of what SpamAssassin does -- the network is trained on a gigantic corpus of training data by the authors, using an arbitrary amount of computer time, and then the results are distributed to the users, who use the network but do not train it. This method of training-by-the-authors seems to work pretty well with SpamAssassin, so I see no reason it can't in principle also work with neural net approaches.

As for dependence on initial conditions, that's due to backprop networks being basically local hill-climbing searches, and there are number of approaches to minimizing local-minimum problems in hill-climbing searches (random restart, momentum and learning rate terms to get something akin to simulated annealing, etc.).

Every other technology lumped into "machine learning" I can think of is similarly slow; Bayes is a bit of a fluke in a way in that it can give such good results so quickly. "Slow" here not in computational terms, but in terms of requiring the user to classify "hundreds" or "thousands" of messages before giving good results, which is slow in human terms; the average user will stop well before the effort pays off. Part of that is a side-effect of the fact that spam and non-spam are so different right now. Bayesian is just really good in this application, especially with the ability for the human to correct the algorithm in real-time fairly easily (one click per message in the mozilla implementation, for instance; excellent convenience. integrating with the mail app is definately the way to go!).
If you're going to require online traning by each user, then Bayesian filtering is indeed probably the best approach (or at least I don't know of any better ones). This approach may reduce false positives by allowing each user's custom-tuned filters to recognize the sorts of non-spam messages they regularly get fairly easily. I'm not entirely sure a centrally-tuned one would necessarily be worse though (perhaps the disadvantages of not being customized per-user would be made up for by a much more powerful architecture and larger training corpus; then again, perhaps not).

Anyway, the point of my post wasn't so much that Bayesian filtering sucks and that there are better approaches, just that the are at least in principle better approaches, and in the future some of them will likely pan out. Meaning that Bayesian filtering may be the best for now, but if it turns out to be inadequate, that may not necessarily be the end of the line for automated spam filtering. And there is somewhat of a perception in the AI community at the moment that Bayesian filters are just being applied to every problem at hand because they're the trendy approach and easy to code up (of course, the same has been said of neural nets...).

[ Parent ]

Have you tried support vector machines? (nt) (5.00 / 1) (#35)
by Magneto on Mon Mar 10, 2003 at 09:04:37 AM EST



[ Parent ]
I have to take exception to that ;-) (5.00 / 2) (#31)
by martingale on Mon Mar 10, 2003 at 02:08:20 AM EST

I found it hard to rate your post, because I think you've got some good points interspersed with some misconceptions (YMMV :-).

You're absolutely right that current "Bayesian" filters are based, when done correctly (a big if in OSS), on naive IID type statistical models. So there's room for improvement. However, your belief in AI without Bayes is misplaced.

All AI methods such as neural networks, "naive Bayesian" machines etc. come down to a decision procedure. Input on one side, decision(output) on the other. Cox(*) showed fifty years ago that the *only* consistent way of transforming input probabilities into output probabilities is via Bayes' rule. Any other rule is necessarily inconsistent.

So if you're looking for a future AI decision making apparatus, unless it's Bayesian, it just won't be consistent. Given the current obsession people have with losing legitimate email, for example, that's already a big problem. The other problem is that if the machine gets to take sufficiently many decisions per second, it's arguable that any inconsistency is bound to show up in some percentage of decisions. Bad.

So we're stuck with Bayes' rule, whether we like it or not (personally, I like it ;-). That's not to say that improvements aren't possible. The current crop of Bayesian filters, when done correctly, typically uses an IID model on words. Spambayes does an ad-hoc thing which is half way between Spamassassin and naive Bayesian, while popfile is properly naive Bayesian as far as I can tell. Then there's lots of incorrect naive Bayesian filters, like the Graham original.

I consider pretty much all AI based on networks and such as incorrect, but of course to check that requires convoluted mathematical arguments and is usually impossible. Some simple neural nets have Bayesian interpretations, which make them provably "correct".

The future of Bayesian type filters must therefore lie, if we want correctness, in more sophisticated statistical assumptions and feature spaces. Things like ngram based markovian models, and phrase based random fields are a start.

(*) I was going to fish out the reference, but here's a web link (haven't read it, so it may not be great quality).

[ Parent ]

I don't get it (2.50 / 2) (#10)
by Matt Oneiros on Sun Mar 09, 2003 at 04:57:19 PM EST

why don't I get spam? I regularly communicate through email, I sign up for many things with my real address.

Yet I don't get spam. I don't know why, everyone else says they do. I feel just a bit left out.

Does attbi filter?

Lobstery is not real
signed the cow
when stating that life is merely an illusion
and that what you love is all that's real

Same for me (none / 0) (#12)
by ComradeFork on Sun Mar 09, 2003 at 06:17:59 PM EST

I have had as my main e-mail a certain address for year. All this time, I have gotten no spam (except for one sent to the mailing list I was on). People whinge, but personally I don't have the problem at all.

[ Parent ]
For me it was Usenet (none / 0) (#15)
by norge on Sun Mar 09, 2003 at 07:21:17 PM EST

I naively posted to Usenet with my real address when I first started using it years ago and the SPAM hasn't stopped since.

Benjamin


[ Parent ]

the devil usenet (none / 0) (#19)
by Matt Oneiros on Sun Mar 09, 2003 at 08:14:40 PM EST

I still post to it these days, put a hotmail account as a reply address though.

Of course I discovered the value of that back when USwest was still around.

Come to think of it, I learnt of the powers of spam via usenet back when USwest was still my dialup provider, then they turned into qwest who then turned over dialup to msn. I then turned to attbi because I got tired of phone companies. Now attbi is going to offer digital phone service, I may just go for it...

Lobstery is not real
signed the cow
when stating that life is merely an illusion
and that what you love is all that's real
[ Parent ]

Bingo. (none / 0) (#21)
by porkchop_d_clown on Sun Mar 09, 2003 at 09:46:44 PM EST

I get upwards of 50 spams per day.


--
You can lead a horse to water, but you can't make him go off the high dive.


[ Parent ]
How to change that. (5.00 / 1) (#20)
by NFW on Sun Mar 09, 2003 at 09:41:08 PM EST

Very bad ideas:

Put your email address on a web page that gets a lot of traffic. Some of the traffic will be spammers' address-harvesting tools, and then it's all downhill.

Post to newgroups using your address.

Moderately bad ideas:

Have your address in your IRC /whois data, and use big public IRC servers and/or popular chat rooms.

Keep putting your address into forms on the WWW.

This is not a complete list. :-)


--
Got birds?


[ Parent ]

my favourite (none / 0) (#23)
by Matt Oneiros on Sun Mar 09, 2003 at 10:26:06 PM EST

the "free teens in your email" type boxes on porn sites.

I did that to a guy I know once. He got pissed.

Lobstery is not real
signed the cow
when stating that life is merely an illusion
and that what you love is all that's real
[ Parent ]

Recommended Windows software (none / 0) (#14)
by edo on Sun Mar 09, 2003 at 07:09:51 PM EST

For Windows users on dial-up who collect their e-mail via POP3 (still plenty of those around, myself included...) I recommend the handy freeware application MailWasher.

It's basically a tool for viewing headers and previewing and/or deleting messages while they are still on the other end of your phone line. Very useful indeed. Beats telnetting into the server...
-- 
Sentimentality is merely the Bank Holiday of cynicism.
 - Oscar Wilde

plussed addresses (4.00 / 1) (#16)
by durkie on Sun Mar 09, 2003 at 07:49:34 PM EST

spam is a rarity for me. i borrowed a trick from my friend tris of using timestamps and plussed addresses so that every email i send has a unique address and i can pinpoint precisely where a spammer got my address from. i have mutt set up so that it takes the output of `date +%s` (say 1047256889), converts it to base 36 for the sake of making it nice and compact (hbiclf), and then appends it to my address (user+hbiclf@domain.com). plussed addresses have always scared me in that the spammer might figure out the real username and then send mail there, but that hasn't ever been the case in the at least 3 years i've been doing this. mail is occasionally sent to the address of say hbiclf@domain.com, but never just the username.

Tagging (4.00 / 1) (#18)
by rf0 on Sun Mar 09, 2003 at 08:14:00 PM EST

One trick that I have learnt is to tag my email addresses. This works as I have catch-all email accounts. By this I mean that *@domain.com -> localuser. So say I go signup for a mailing list from wibble.com I use an address of wibble.com@domain.com.

That means that I can a) filter the address and b) if I start getting spam on that address I know who sold it. Also when posting on usenet I always use usenet@domain.com and automatically /dev/null it.

It might not cut down on spam but at least you can track where it comes from

--
a2b2.com - Stable, Friendly Decent Hosting

I do much the same... (none / 0) (#25)
by Kaki Nix Sain on Sun Mar 09, 2003 at 10:31:49 PM EST

... with sneakemail.

For a while, a couple of years ago, I was up to about 17 spams/day. These days I get maybe one spam every few days. I sometimes wonder what all the fuss is about. :)



[ Parent ]

More Bayes attack stuff (5.00 / 2) (#24)
by Jerf on Sun Mar 09, 2003 at 10:27:43 PM EST

I posted a report on my hypothetical Bayes attack a while ago; I really should internally link all those posts together.

Upshot: It was a harder then I expected, but on the other hand, I didn't cheat at all; spammers will.

Also, not posted (at least not as clearly as I'd like) is that the experience did back up my expectations of how the Bayes filter will work; there is a strong core of "goodness" that I expect will come out regardless of how "personalized" the filter is. The personalization signiture is small and swamped by perfectly normal differences between spam and non-spam; it might make or break a message that is on the edge, but under normal circumstances, you'd probably have to hand-tweak the filter to grossly disproportionately "like" your personalization words.

And generally you can accomplish the behavior you're looking for through whitelists, which regardless of the spam fighting technique are a good backup anyhow, as they will go a long way towards mitigating the false positive problem.



pretty neat (4.00 / 1) (#32)
by martingale on Mon Mar 10, 2003 at 02:21:47 AM EST

Your experiment is pretty cool. It'd be good if you put down the actual equations you used, though.

Having said that, I remain unconvinced by the handwaving on your web page (yes, I realize that if I say that, I'm not your target audience). Without the equations, I'm not sure I believe you have much hope tackling Bayesian models which are more sophisticated than IID word models. One of the problems with ngram models is the exponential increase in the number of tokens. If I get what you're trying to do, you're looking to string together (manually, at present) a message by threading toghether low or average probability tokens, while keeping a semblance of meaning.

That's relatively easy in an IID word model, but it's going to be hopeless for say a trigram model, I think.

[ Parent ]

Trigram easier or harder? (none / 0) (#42)
by Jerf on Mon Mar 10, 2003 at 09:47:41 PM EST

Bi- or tri-grams may be easier to use, even though they'd be a little harder to program (not a stopper), with a good interface, because of the greater variety of choices. That's what the "markov-style" probability chains are for I mentioned in one of the message; as you're typing the message, "SpamWriter2" would show you what next word you could use based on the trained spams.

(Initially I was going to wait until SpamBayes started using n-grams, but the timing didn't work out. And I wanted to use SpamBayes because it was in Python, which for me would be one of the easiest to program.)

I emphasize "may" because strictly speaking, I don't know what would happen in a n-gram environment, because I haven't tried it. I know I can extend the attack to handle it (after all, the original proposal was based on an n-gram solution and what I did was really a special case for 1-gram), but until someone tries it, we don't really know if it would work, or be impossibly hard (or easy!).

As for the actual equations, I used the SpamBayes implementation of Gary's algorithm and Chi-squared. Last I knew, SpamBayes is only available via CVS but the documentation in classifier.py (or classify.py) is quite clear on how it works; Tim Peters, one of the main authors on that project, is an excellent numerical programmer by speciality.

Unfortunately, I just realized about an hour ago when my laptop died, I lost all the code for this because I didn't see fit to back it up. So I can't give you the exact time I pulled from the CVS. However, to justify myself a bit, part of my point with the "common core" of goodness stuff is that regardless of the exact varient, if you train on enough hams and spams you should be able to replicate what I did. So in a sense, the exact equations don't matter much. I do regret losing the code though, because it certainly does detract from the scientific validity.

Though truthfully, science or not I don't think I could have ever brought myself to release the code. I think I would have cried the first time I received a spam that obviously used a varient of my code to write it. (At least now I have plausible deniability for myself; some spammer may have replicated my ideas without necessarily reading my stuff, and I know they don't have my code. ;-) ) Kind of an odd position to be in, science-wise. The conundrum in this case is even worse then deciding whether to release details on a security bug!

So, fair point in that the details are a little lacking from a scientific perspective. But you're also right that I'm trying to convince others, even non-scientists, that this isn't going to end spam; we really need to be working on non-filtering based avenues.

[ Parent ]

I think it'd be harder (none / 0) (#44)
by martingale on Mon Mar 10, 2003 at 11:58:10 PM EST

but without data, I'm just wildly speculating.

Here's how I'm thinking about it (nonmathematically, of course): Let's say you have an original message O. Your method is supposed to transmogrify O into a new message FO (Fake Original), which has a different probability under H, the "Ham" model.

Let's say H is an IID word model (I haven't looked at the chi^2/Gary stuff yet, but similar reasoning would possibly apply). The log probability of O under H is, to within a constant, the sum of the log probabilities of each word w, over all words w in O. After transformation, the log probability of FO under H is, to within the same constant, the sum of the log probabilities of each transmogrified word Fw. I'm assuming you're simply changing each word in the message, for simplicity, rather than changing whole chunks of O at once.

So what your program does is help you to map each word w to a word Fw, so that the probability of Fw under H is higher than the probability of w, say. In reality, what you want is that the sum over all w in O of the log probabilities of Fw under H be higher than the same sum over the log probabilities of w under H, but let's keep it simple.

Now let's look at your experiment in this simplified setting. You took a couple of hours (with no prior practice) to find, for each w in O, a synomym Fw whose probability under H is higher than w under H. With practice, you could probably be faster. You could probably even write a program to solve this simplified problem.

What's interesting here is that the subproblems (replacing w with Fw) are independent, which makes it, relatively speaking, an easy problem.

Now say you have a trigram model TH. The log probability of O under TH is also a sum, (up to a constant) but it's a sum over all consecutive triples of words. Solving the transmogrification problem would entail finding a transformation TF such that TF(w1,w2,w3) has higher probability under TH than (w1,w2,w3) has, for all consecutive word triples in O.

But now there's a difference. Whereas before you could find F independently for each token w, here you need to find TF such that simultaneously TF(w0,w1,w2) improves (w0,w1,w2), and TF(w1,w2,w3) improves (w1,w2,w3), and TF(w2,w3,w4) improves (w2,w3,w4). Allright, you only need to improve the sum of the final three log probabilities compared to the sum of the original three, but let's keep things simple.

I claim it's an intrinsically harder problem, because each time there are more constraints. Remember, the final text FO is supposed to make sense, and broadly speaking have the same meaning as the original O. What you would need to do is find a trigram synomym for the first trigram, then a trigram synomym for the second trigram under the constraint that the first two words in the second trigram are already given (by the last two you chose as a synonym for the previous trigram), and so on.

In other words, if you first solve for (y0,y1,y2) such that (y0,y1,y2) = TF(w0,w1,w2), then you must next solve for y3 such that (y1,y2,y3) = TF(w1,w2,w3) and then for y4 such that (y2,y3,y4) = TF(w2,w3,w4) and so on.

I'm not sure English has enough degrees of freedom to do this kind of thing.

Now admittedly, I've stated these things in terms of an IID naive Bayes algorithm, and there'll be differences in the chi^2 stuff by SpamBayes. I've stated things deliberately in detail in case I'm completely missing what it is you're actually doing.

What else? Bummer about your code, but it's probably for the best. For what it's worth, I wasn't too interested in seeing your actual program, rather I wanted to see the mathematical analysis it is based on.

Actually, it's probably a good thing about your code, because I think it would have been particularly "helpful" against the plethora of Graham inspired algorithms, which only look at five or ten of the most surprising tokens in the whole message. In that case, the problem reduces, for each message, to essentially finding synonyms for five or ten independent words - trivial compared to what I'm suggesting above.

[ Parent ]

IRTF ASRG Observations (5.00 / 2) (#26)
by NFW on Sun Mar 09, 2003 at 10:41:09 PM EST

I've been on the Internet Engineering Task Force (IETF) Anti-Spam Research Group (ASRG) mailing list for a few days now.

Every time someone proposes a system that solves one problem (e.g. causes users to spend way less time dealing with spam), the system gets shot down, often as rudely as eloquently possible, because it comes at a cost (e.g. increases the load on the infrastructure), or doesn't fix the problem for everyone all at once, or works for private individuals but not commercial users, etc, etc.

It's painful to watch.

If spam bothers you, don't wait for an official solution. That could take a while (understatement). Grab something like SpamAssassin, TDMA, ASK, etc, and have at it.


--
Got birds?


Spam filtering (1.00 / 2) (#29)
by 187 on Mon Mar 10, 2003 at 12:45:36 AM EST

Spam filtering is very important, but not as important as getting a blue diamond rolex for $100.

187

Oh, the irony (none / 0) (#33)
by carbon on Mon Mar 10, 2003 at 02:49:29 AM EST

A spam message to an article on spam.


Wasn't Dr. Claus the bad guy on Inspector Gadget? - dirvish
[ Parent ]
Stop most of it at the firewall. (none / 0) (#30)
by libertine on Mon Mar 10, 2003 at 01:59:25 AM EST

A friend of mine made a perl script that will create firewall rules based on the SPEWS list and add them to a firewall conf file.

Address is at Madripoor.org


"Live for lust. Lust for life."

What I would like to see, a variant of DCC/Raxor . (4.20 / 5) (#34)
by arcade on Mon Mar 10, 2003 at 04:32:29 AM EST

I've long been speculating on how one could make an almost perfect spam filter. I did come up with an idea, but then Vipul's Razor came out, which seemed to do what I thought of. But, not quite.

What I want is a distributed network, kinda like Gnutella, to distribute checksums of spam.

Some will spot the immediate problem. Malicious reporting of mailinglist messages, say bugtraq. However, I think this can be solved quite easily.

Each checksum would need to be signed using, say, GnuPG. The packet would need to contain a version number, key-id, a checksum and a signature. The key-id would then be used to retrieve the public key from a keyserver.

The interesting thing here is that one could assign 'trust' to various signatures. It could be done automagically. If one sees a single false report from a certain signature, every report from that particular host would be ignored in the future.

To avoid flooding of this distributed network, the protocol could require that each spamreporting host only report each hash once.

Lots of other ideas surely could be implemented too. The bottom line is that it should be possible to implement a network of spamtraps that reports to this distributed network, and with enough spamtraps almost no spam would go unreported. Each spamrun would be reported within minutes of starting, and those filtering could just drop the spam.



--
arcade
Uhm, (5.00 / 2) (#36)
by chiquitita on Mon Mar 10, 2003 at 09:47:46 AM EST

How does one 'see a false positive'? If you manually see it then you can stop trusting that individual, but that isn't enough to prevent the network from being seriously abused.

If you can cause other people to stop trusting a key, then how do you keep the spammers from falsely claiming all of the keys on the key-server are not to be trusted?

You don't have an adequate solution to revoking trust.

-x3n0 (via gf account)


[ Parent ]

weighing. (none / 0) (#46)
by arcade on Tue Mar 11, 2003 at 05:05:43 AM EST

You would have to weigh the one that claims a misreport, against the other one. Of course this couldn't be done automagically ALL the time, but it should be possible to create a quite good 'web-of-trust' with a minimum of fuzz. :)



--
arcade
[ Parent ]
Close, but you want to assert goodness,not badness (none / 0) (#43)
by Jerf on Mon Mar 10, 2003 at 09:55:04 PM EST

Close, but instead of trying to identify spams, identify good senders, as long as you're using GnuPG. You can also then set up webs of trust; if you trust A, and A trusts B, you trust B some amount.

This is probably the best solution in the long run, and the one that involves the least centralized control. It's just that spam has got to become such a big problem that it's worth doing. Perhaps we'll never quite be that collectively irritated.

(The web of trust would be useful in other ways, though.)

[ Parent ]

That was sort of what I intended. (none / 0) (#45)
by arcade on Tue Mar 11, 2003 at 05:04:01 AM EST

Sorry if it didn't come through clearly, but that was one of the things I had in mind. Webs of trust, _in addition to_ removal of trust if there are misreporting. No misreporting should ever be allowed.



--
arcade
[ Parent ]
De-legitimize HTML in email (4.00 / 1) (#37)
by IHCOYC on Mon Mar 10, 2003 at 10:41:48 AM EST

I scan all my email for the presence of HTML tags and MIME attachments; emails containing them are shunted to a special folder, which I used to open only offline. (Now that I have switched to Mozilla, there is no longer any need to do so, since I can disable scripts and external images in email without disabling them generally.)

Legit emails do not need HTML formatting. Removing the HTML emails gets rid of at least 90% of the spam I get, though it does flag mails from AOLers as false positives. If I am expecting to hear from you, I can move mails to you to a legit inbox. Getting rid of HTML emails allows some spams through, mostly Third World Dictator scams, but gets rid of most of ths spam.

Whoever thought it would be a good idea to include HTML in email was an idiot in the first place. There is absofuckinglutely no reason why anyone should wish to decorate their email with that kind of crap. Ideally, I'd want a mail application that simply removed HTML formatting and codes in email, and displayed them all as raw text.
 --
The color is black, the material is leather, the seduction is beauty, the justification is honesty, the aim is ecstasy, the fantasy is death.

pine (none / 0) (#38)
by derepi on Mon Mar 10, 2003 at 11:28:08 AM EST

Ideally, I'd want a mail application that simply removed HTML formatting and codes in email, and displayed them all as raw text.

It's called pine. Use it every day. It's even got its own filters so that you can dump all html mail to its own folder and never look at it, if you're so inclined.

[ Parent ]

You must get a lot of false positives (4.00 / 1) (#40)
by gasull on Mon Mar 10, 2003 at 04:30:14 PM EST

A lot of people send HTML email because their mailers send it by default, what is contrary to netiquette in some mailing lists.  What we need is mailers sending by default email as plain text.

[ Parent ]
Recognising spam by its bulkiness - part 2 (4.00 / 1) (#39)
by bsimon on Mon Mar 10, 2003 at 11:30:37 AM EST

If content-based filtering (eg Bayesian) is really doomed as jerf suggests, what other options are there?

How about... if e-mail clients automatically log the IP address of the sender of each piece of mail they receive - and pass that IP address over the Internet to a central database.

The more e-mail the database spots coming from a single IP, the stronger the probability that it is spam. Then clients can use that probability to decide whether to move or delete suspect mails, or they can feed it into a content-based (eg bayesian) filter to improve the filter's accuracy.

AFAIK, it's impossible to conceal the IP address of the mail server (or the relay, if one is used) - and if I'm wrong about that, then this system won't work...

This method doesn't require access to the content of the mail, and it doesn't care if the spammer is using an open relay - it'll simply block the relay, making it useless. And, given sufficent clients, it works in real time, detecting and deleting waves of spam almost as fast as they can be sent out.

Possible weaknesses:

  1. This system becomes more effective the more people use it - if only a small number adopt it, then it won't catch very much spam and it will be slow to detect spamming IPs. And it works best if the software is always online, checking for mail frequently, so bulk mailings can be detected and blocked immediately.
  2. It relies totally on the fact that usable IP addresses are expensive (time/money) for spammers - because the spamming IPs rapidly get shutdown or blacklisted and the spammers have to move on.
  3. It would require a whitelist of legitimate sources of large amounts of e-mail, like hotmail.
  4. People might be concerned about privacy, with all those IPs going to a central database.
  5. It would need a secure way of identifying and communicating with clients - to prevent it being flooded with false data
Actually, this all seems rather obvious, I assume there must be software out there that can already do this already... anyone know?

you have read my sig

Problem determining sender IP (5.00 / 1) (#48)
by pw201 on Tue Mar 11, 2003 at 08:14:28 AM EST

The problem with this is determining the sender IP. You can only really trust the IP address recorded by machines which you know to be well run, which must initially just be your own machines (or the mail exchanger machines of your ISP if you're not handling mail for youself). While it may be possible to decide to trust other servers to record IPs correctly, that'll be a slow process. In the meantime, you'll end up finding that the outbound mail servers of big ISPs are the ones with the highest counts.

You could take another approach and decide that you'll scan all the Received headers for IPs and increment all their counts by one, but that then has the above problem and leaves the system open to attacks via spoofed headers.

[ Parent ]

Are many machines not 'well run'? (none / 0) (#49)
by bsimon on Wed Mar 12, 2003 at 07:31:28 AM EST

I've no idea what percentage of machines aren't 'well run', and therefore really can't be trusted to pass on the sender IP correctly. I assumed it was a relatively tiny number, say <1%?

If it's a small percentage, then why not just put their IPs into the database as likely 'secondary' spam sources? The system then assumes that mail which has passed through those machines is more likely to be spam.

RE: big ISPS as spam sources... I thought the outbound mail servers of big ISPs weren't such a serious problem.  

Because... the ISPs can quickly detect a suspicious volume of mail coming from a single account and do something about it - forcing the spammer to open new accounts repeatedly and send a few hundred emails from each, or find an open relay. I don't know how well this works in the real world, but at least it doesn't seem like a difficult technical problem.

you have read my sig
[ Parent ]

Where the chain really starts (5.00 / 1) (#50)
by pw201 on Wed Mar 12, 2003 at 02:22:42 PM EST

The problem is working out where the chain of relays really starts. The chain of senders is contained in the Received lines of the headers. Every host which passes on the mail should add a Received line saying "Received from ip_address by fred" or something. These are prepended to the existing headers by each relay.

If I'm a spammer and I want to fool your system, I just add a fake received line with a randomly varying IP address to the message as I send it out, saying "Received from random_ip by spammer's cable modem". If your process works by just taking the last Received header as the originator, then it never accumulates a high count for any one IP, since the spammer's faked Received line is the one with the IP you're counting. Humans can usually work out faked headers, but it's hard to build a computer system which can.

[ Parent ]

Some tools (none / 0) (#41)
by gasull on Mon Mar 10, 2003 at 04:35:21 PM EST



software (none / 0) (#51)
by grant7 on Sat Jun 14, 2003 at 12:10:40 AM EST

there are basically two types of spam filtering software: those based on specific rules (filters) and the Bayesian statistical analysis type. there are spam filters for clients and servers, this platform and that, but they break down into those two categories.

for server-side spam filtering there are a couple good Bayesian ones: SpamAssassin and DSPAM. in the rules/filters category there is Active Spam Killer - similar to BlueBottle mentioned above, it has whitelist matches to allow mail through, a confirmation reply to unknowns placed in a holding queue, and a blacklist for addresses known to send spam. the messages in your queue you can review and decide whether to deliver, whitelist/blacklist, delete, etc. by sending a message to your own address with the subject 'ask process queue'. simple system, but effective.

the only problem with the Bayesian statistical analysis approach is that when you do have a false negative (spam gets through) or worse a false positive (real mail gets filtered) it can be very difficult to tell exactly why it occurred, which I know personally is very befuddling... I like to know why!   ;-)

of course the more mail that goes through one of those systems the better its accuracy, so perhaps the DCC is worth a try.

I think what frustrates a lot of people is that there should be no way to send spam in the first place... finding out that the de facto standard process for relaying mail does not include authentication can be a disturbing thing to learn, yet it is the only way mail gets from one node to the next. Recognizing that the global network is the best place to have electronic mail (as opposed to a localized subnet) has led to the necessity of dealing with spam, security and privacy. It sure would be nice to have encryption built into mail protocols rather than software systems.

mozilla -mail (none / 0) (#52)
by frijolito on Wed Jun 18, 2003 at 07:19:51 PM EST

I'm very happy with Mozilla Mail's bayesian spam filtering. I recommend it to anyone who wants to get started with spam filtering, but are lazy as f*ck like myself.

Spam filtering: a whistle-stop tour | 52 comments (46 topical, 6 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!