Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Email filtering with Procmail + SpamAssassin + ClamAV

By mbreyno in Technology
Mon Feb 02, 2004 at 01:59:06 AM EST
Tags: Internet (all tags)
Internet

We all get a lot of crap these days in our email and it's pretty much a necessity to have some form of filtering in place. So what's the best way to protect your inbox on your *nix server? Well, here's a quick and easy way to filter spam and viruses with free software and about an hour.


On my server, I'm running SuSE Linux 8.2 and using Postfix for mail delivery.

Procmail

Procmail is a mail processing utility that allows you to sort your incoming mail based on criteria that you specify. With Procmail, you can look for certain patterns or mail header information and perform actions on that message depending on what conditions it meets.

On my server, procmail was installed by default. To check, just run:

which procmail

If it's installed, you'll see the path. If it's not installed, grab it from the SuSE RPM site or wherever you get stuff for your OS and off you go.

SpamAssassin

SpamAssassin is a mail filter used to identify spam. It uses a wide range of heuristic tests on mail headers and body text to identify spam and assigns a point value to each message. If a message gets enough points, it is deemed spam and can then be filtered accordingly.

SpamAssassin is a pretty easy install. The instructions can be found at http://www.spamassassin.org but here are the quick and dirty instructions:

(become root)
perl -MCPAN -e shell
o conf prerequisites_policy ask
install Mail::SpamAssassin
quit


This will install the spamassassin executable and necessary scripts. Pretty easy, eh?

ClamAV

ClamAV is an open source virus scanner. It currently identifies over 20,000 viruses. The virus database is kept up to date with the help of the user community. Users can submit new viruses as they are discovered and they will quickly be added to the database.

ClamAV requires compiling (at least on my Linux distro) but it's also pretty painless. Head on over to http://www.clamav.net and download the latest. Installation instructions are on the site, but here's the cheat sheet:

(become root and enter the source directory)
groupadd clamav
useradd -g clamav -s /bin/false -c "Clam AntiVirus" clamav
./configure --sysconfdir=/etc
make
make install


Next, edit /etc/clamav.conf and comment out the word "Example".

Now, before we put it all together, we need to grab a nice little script called clamfilter.pl. You can get it at http://www.everysoft.com/clamfilter.html. Download this script and place it somewhere on your filesystem. I like to place Perl scripts in /usr/local/scripts. Kudos to Matt Hahnfeld at EverySoft for this script!

Putting It All Together

Now that everything is installed, we need to actually do some filtering. To set up filtering for your user account, first create a file in your home directory called .forward (yes, that's "dot forward") and put this line in it:

"|IFS=' ' && exec /usr/bin/procmail -f- || exit 75 #user"

If procmail is installed in a different location on your server, adjust accordingly.

Now, create a file in your home directory called .procmailrc ("dot procmailrc") and put this in it:

:0fw
| /usr/local/scripts/clamfilter.pl

:0:
* ^X-Virus-Found: yes
mail/Quarantine

:0fw: spamassassin.lock
| /usr/bin/spamassassin -a

:0:
* ^X-Spam-Status: Yes
mail/Quarantine


This runs all mail through ClamAV and SpamAssassin and if it's tagged as spam or a virus, it drops it in the folder called "Quarantine" in your mail directory. Notice that I'm invoking SpamAssassin with the -a flag so that auto-whitelisting is enabled.

Maintenance

Now we need to set up some cron jobs for maintenance. First off, let's teach SpamAssassin to be better each day. Grab a little script I wrote called sa-learn.pl and place it somewhere (like /usr/local/scripts). Edit it as necessary and run it from cron once a day. This script will go through each user's home directory and look for folders called MissedSpam and NotSpam. If a user has placed missed spam or false positives in either of these folders, SpamAssassin will learn from it.

Next, let's make sure our virus definitions are updated. Set freshclam to run once an hour. Freshclam is a program that connects to the ClamAV site and checks for database updates. The command is:

/usr/local/bin/freshclam

Here is what my crontab file looks like:

# teach spamassassin - run every morning at 3am
3 0 * * * /usr/local/scripts/sa-learn.pl >> /dev/null 2>&1

# update virus definitions every hour on the hour
0 * * * * /usr/local/bin/freshclam --quiet


Now, we have a quick, easy, and effective solution for filtering spam and viruses. Just give your .forward and .procmailrc files to any other users who want to activate email filtering. You could even write a short script to automatically copy these files into everyone's home directory. Enjoy!

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Related Links
o Procmail
o SuSE RPM site
o SpamAssass in
o http://www .spamassassin.org
o ClamAV
o http://www .clamav.net
o http://www .everysoft.com/clamfilter.html
o auto-white listing
o sa-learn.p l
o Also by mbreyno


Display: Sort:
Email filtering with Procmail + SpamAssassin + ClamAV | 65 comments (54 topical, 11 editorial, 2 hidden)
Procmail stuff (2.62 / 8) (#2)
by gazbo on Sat Jan 31, 2004 at 07:28:24 AM EST

OK, this article is pretty useful for me, as it's something I've been meaning to implement for a while. But, and I know I'm just being lazy by not looking for myself, there is an extra piece of functionality that would be very useful:

I have three "classes" of email address at work:

  • My personal email addresses, which are 95% spam free
  • root@various-servers, which are 50% spam, 50% cron
  • hostmaster@numerous-domains, which are 99% spam
So really, I'd like to be able to filter my email so that my personal email addresses are passed through untouched, as false positives are pretty much entirely unacceptable, but send the other two classes through the filter. In actual fact, we're probably talking about 50 or so email addresses, so wildcards would be a definite bonus.

I'm guessing it's fairly easy, but I have no knowledge of procmail.

-----
Topless, revealing, nude pics and vids of Zora Suleman! Upskirt and down blouse! Cleavage!
Hardcore ZORA SULEMAN pics!

You can do that. (3.00 / 4) (#13)
by waxmop on Sat Jan 31, 2004 at 12:42:55 PM EST

In your .procmailrc file, just do a test before you send any messages to the clamfilter.pl script.

:0:
* !From: safe-addresses
| clamfilter.pl

That will only send emails to the clamfilter script that aren't from safe-addresses. In my setup, I get a lot of email from mailing lists that already run spamassassin, so I only submit mail to spamamssassin if it doesn't already have the X-Spam-Status filter.
--
We are a monoculture of horsecock. Liar
[ Parent ]

whitelist_to (none / 3) (#27)
by BenJackson on Sun Feb 01, 2004 at 03:56:43 PM EST

If you use SpamAssassin you can 'whitelist_to' your personal address. It's only worth (by default) -6, but that will let some pretty spammy stuff through. Unless you're concerned about resource consumption or delivery delays (SA can take many seconds if it has to scan a large message) I would just route everything through SA. It has much more power and flexibility than procmail.

The other reply specifically shows how to exclude virus tests based on from address. Since most viruses are forged using address books, who they are from is not a useful test.

For using procmail to filter based on who the message is to you want to use a special procmail pattern that is designed to match as many of the relevant headers as possible:

:0:
* !^TO_safeaddress@me.com
| script-for-NOT-to-safe-me
The ^TO_ is for matching addresses. There's also ^TO which can be used to match any word on the line. Beware examples that use ^TO_.* -- that defeats the end of the pattern.

[ Parent ]
SpamAssassin is total crap. (1.92 / 13) (#3)
by kitten on Sat Jan 31, 2004 at 09:51:59 AM EST

Running SA on our mailserver. It gives me dozens of false positives, and misses many very obvious spams. I have two filters in place on my local mail client - just two - and those catch more spam mails with less false positives than SA ever has.

SpamAssassin uses idiotic criteria and assigns bizarre, illogical point values to those criteria. For getting rid of spam, it is almost entirely worthless.
mirrorshades radio - darkwave, synthpop, industrial, futurepop.
Works For Me. (2.66 / 6) (#4)
by pwhysall on Sat Jan 31, 2004 at 10:37:39 AM EST

Perhaps you need to configure it correctly?

That said, on my Debian system, I just installed it and it picks up > 98% of the spam with no false positives so far. The 2% is, I suspect, a result of the latest spammer tactic of sending spam that's designed to poison Bayesian token databases.

Either way, your assertion that SpamAssassin is "total crap" is ill-informed and backed by no evidence. An awful lot of other people seem to get an awful lot of value out of it.
--
Peter
K5 Editors
I'm going to wager that the story keeps getting dumped because it is a steaming pile of badly formatted fool-meme.
CheeseBurgerBrown
[ Parent ]

Evidence? (2.55 / 9) (#6)
by kitten on Sat Jan 31, 2004 at 11:16:42 AM EST

Sure thing!

SA uses shitty criteria. For evidence, let's examine some of the things it looks at.

It assigns points if the "from:" header isnt' a real name. Such as, I don't know, "support@bellsouth.net" or "billing@earthlink.com" or something like that. Legitimate emails which now have 2 or 3 points assigned to them. I get a lot of email from people who just dont' configure their from: settings, so the default is to use whatever address it's sent from. SA should not penalize for that - most spam does have a real name, like "kristi" or "kelli" or "Your New Car" or something.

It assigns points for being listed in Razor2. Fine and good, I guess. But then it assigns additional points for the "confidence rating" of Razor2. It's double jeopardy. If something has a confidence rating in Razor2, it follows that it's also listed there - having two seperate criteria for that is ridiculous.

It assigns points for HTML mails. I don't know about you, but in jobs I've had, I've gotten lots of HTML emails, usually from brain-dead secretaries who like to send HTML mail so they can make their fonts pink and add pictures of puppies. I discourage HTML mail, but to tag it as spam is silly.

It then assigns more points for how much HTML is included in the mail. For example, 2 points are added if the mail is "90 to 100% HTML". God knows why. If a mail is 100% HTML, it's likely that it's just some idiot secretary or your mother-in-law. I'd be more leery of a mail that was only 50% HTML, but SA lets that slide.

Here, it just tagged mail as spam because the "from:" ends in numbers. The mail in question was from a friend of mine whose "from:" does not, in fact, end in numbers. I just double checked the source. I have no idea why SA tagged that, other than the notion that it's crap.

There are plenty of other documentable examples. I wrote a comment not too long ago about it, but now I can't seem to find it. Suffice to say that SA's criteria is crap, it assigns low point values for obvious spam tactics and high point values for things that are usually benign, it overlooks a lot of obvious things and assigns double or triple penalties for many things.

I have two filters. The first checks the subj: header for keywords like "viagra", "mortgage", "loan", "vitamins", "coupon", and about 25 others which I won't bother listing - I'm sure you can figure it out.

The second checks the body for the word "click". That alone catches 70% of the spam.

Combined, these two filters snag more spam than SA ever has, with a lower rate of false positives.
mirrorshades radio - darkwave, synthpop, industrial, futurepop.
[ Parent ]
Coming from a rookie here but.. (2.80 / 5) (#7)
by cosmokramer on Sat Jan 31, 2004 at 11:39:25 AM EST

From my experience with SpamAssassin it is basically a shell of possible spam checks you can run against a message and you choose exactly how many points each rewards either positive or negative and what checks take place and you can add custom blacklists and checks and well it's pretty much a custom tool.  In fact it really doesn't actually do that much does it?  Anyways I don't think I'm that far off base here but it's happened before..

[ Parent ]
Basically (2.00 / 4) (#8)
by kitten on Sat Jan 31, 2004 at 12:01:44 PM EST

It sits on the mailserver and checks incoming mail against certain criteria, and if it thinks something is spam, it adds ***SPAM*** to the subject line, and in the body it puts about three paragraphs worth of garbage about why it was tagged. The idea is that you can then set your local mail client to discard any mail with "SPAM" in the subject, if you trust SA's judgement, which I don't.

It gets annoying on false positives, because now your legitimate email has about a page of crap in front of it, "SPAM" written all over it, and sometimes SA won't show you the original mail at all, but gives it to you as an attachment.
mirrorshades radio - darkwave, synthpop, industrial, futurepop.
[ Parent ]
And you know what? (2.75 / 4) (#12)
by pwhysall on Sat Jan 31, 2004 at 12:29:33 PM EST

You can change that behaviour. You can either have a simple X-Spam-Flag: header, or you can have a X-Spam-Level: header; this latter setting is where you see the score. You can choose to modify the message, the subject line, or neither.

But hey. If you don't like it, you don't have to use it.
--
Peter
K5 Editors
I'm going to wager that the story keeps getting dumped because it is a steaming pile of badly formatted fool-meme.
CheeseBurgerBrown
[ Parent ]

Gee Peter (2.00 / 6) (#14)
by kitten on Sat Jan 31, 2004 at 01:10:32 PM EST

I'd love to not use it, but Bryan thinks it's great. So I don't have much of a choice, unless I want to go back to the accursed and unreliable ISP-based mail.

But that's largely an aside. All the fiddling with settings in the world won't change SA's shitty criteria and illogical point assignments. I've pointed out several examples of things SA looks at, which are stupid, and a few more examples of how it assigns points that defy any notion of sanity.

So far, nobody has said "That isn't true, and here's why." The only answer I've ever gotten is "Then don't use it," which I interpret as "Yes, it is a steaming pile of fetid dingo's kidneys and your points cannot be argued with since you are pulling them from actual examples, but since I don't want to hear it, I'll just tell you to get lost."

It's like watching Creationists beat their drums against all evidence, or players of BZFlag insisting that their game really doesn't suck wind.
mirrorshades radio - darkwave, synthpop, industrial, futurepop.
[ Parent ]
It isn't true, and here's why: (2.88 / 9) (#16)
by pwhysall on Sat Jan 31, 2004 at 02:22:34 PM EST

  1. Even if it's installed on your mail system, and the system configuration is not what you need, you can override it. Read the manual about ~/.spamassassin.
  2. It does work; I can only conclude that your spam is different to mine, and that of the people and companies who use it, and consider it worthwhile.
  3. You love bitching about things; even if SpamAssassin killed 100% of all spam dead, with nary a false positive, you'd moan about the lack of iambic pentameter in the instructions.
  4. I don't think SA's tests are shitty and illogical; they go in there because they catch spam.
  5. Those tests you pointed out are all there because they are indicative of spam; taken alone, no single test will turn a ham message into a spam message. Surprise! messages with high scores, because they failed a lot of different tests that look for spamesque features in a message, tend to be spam. False positives are a problem, which is why you get your sysadmin to run sa-learn on your freshly de-spammed inbox.
  6. If you've got friends who send email that looks like spam, then you'll have to either whitelist them (RTFM again) or alter your local SA rules to accommodate them. Alternatively, you could persuade your friends to stop sending you mail that looks like spam.
Whether your local SpamAssassin installation is working or not, kitten, you will have to at some point face the reality that the software is working for other people, and working well.
--
Peter
K5 Editors
I'm going to wager that the story keeps getting dumped because it is a steaming pile of badly formatted fool-meme.
CheeseBurgerBrown
[ Parent ]
uh (2.40 / 5) (#17)
by kitten on Sat Jan 31, 2004 at 04:32:18 PM EST

It does work; I can only conclude that your spam is different to mine, and that of the people and companies who use it, and consider it worthwhile.

Yeah, I among everyone else get the "weird" spam. Right.

The spam isn't the issue. The criteria SA uses is the issue. Let's examine.

0.2 HTTP_WITH_EMAIL_IN_URL URI: 'remove' URL contains an email address
A pretty fucking obvious spam tactic. Gets a measly .2 points.
0.2 EXCUSE_14 BODY: Tells you how to stop further spam
The message freely admits that it's spam, and SA gives this a puny .2 points.

0.0 CLICK_BELOW Asks you to click below
As I said, I have a filter on my MUA that checks for the word "click". I have this because 70% of the spam you'll receive will include that word. SA notes it and gives no points whatsoever for this blindingly obvious tactic.

0.5 HTML_MESSAGE BODY: HTML included in message
Half a point for an HTML mail. I hate HTML mail but there are plenty of legitimate mails that have it. This mostly benign criteria gets more points than "click" or "how to remove".
NO_REAL_NAME (0.8 points) From: does not include a real name
Again, this gets more points than any of the above, yet there are plenty of emails that are legitimate that won't include a real name. Most spam, however, does include a "real" name - "Your Blind Date" or "Win A Cruise" or something. Why SA looks at this, I have no idea.

These point values are, objectively speaking, bloody idiotic. These are but a few quick examples - there are dozens more.

I don't think SA's tests are shitty and illogical; they go in there because they catch spam.

Except they don't. See above. Add that to the fact that many of these criteria apply to legitimate emails, sometimes more often than spam mails.

taken alone, no single test will turn a ham message into a spam message. Surprise! messages with high scores, because they failed a lot of different tests that look for spamesque features in a message, tend to be spam.

"No single test", except for when SA assigns the same test over and over. A mail gets points for contianing HTML, and then additional points for how much HTML, and that criteria is the complete opposite of what it should be anyway. I've had legitimate mails get tagged because of this. And that's just one example of SA's propensity for double-jeopardy.

Alternatively, you could persuade your friends to stop sending you mail that looks like spam.

You of all people should know better, Peter. I would rather rip my toenails out with rusty screwdrivers than try to explain this crap to my grandmother, or tell some secretary why she shouldn't use HTML mail to make her font pink, or explain to the boss why he needs to get a different ISP because his ISP hasn't closed certain relays, or ask my friend to please change his email address to something that doesn't end in numbers. Half of them wouldn't understand or wouldn't care and the other half would tell me to piss off.

Here's an alternative: Apply the same two filters I use on my MUA, and be done with the spam problem once and for all, without having to jiggle about with all kinds of fancy-pants rules, learning algos, settings, adjustments of point assignments, and fuck-knows-what-else.
mirrorshades radio - darkwave, synthpop, industrial, futurepop.
[ Parent ]
Again maybe I'm on crack here but .. (none / 3) (#24)
by cosmokramer on Sun Feb 01, 2004 at 12:06:27 PM EST

0.2 HTTP_WITH_EMAIL_IN_URL URI: 'remove' URL contains an email address

0.2 EXCUSE_14 BODY: Tells you how to stop further spam

0.0 CLICK_BELOW Asks you to click below

0.5 HTML_MESSAGE BODY: HTML included in message

NO_REAL_NAME (0.8 points) From: does not include a real name

Are these valus not completely adjustable? You can modify the SA ruleset to assign whatever values you like to anything.. if you feel the "0.2 HTTP_WITH_EMAIL_IN_URL URI" deserves more than 0.2 give it 10. Whatever you like.. the fact is as I think I stated before SA is only as good as you customize. Out of the box it is useless because SPAM changes constantly along with the methods spammers employ to trick people.

[ Parent ]

Maybe, maybe not. (none / 3) (#28)
by kitten on Sun Feb 01, 2004 at 05:51:47 PM EST

So basically, what everyone here is telling me is that all I have to do is spend an hour fiddling around with configs, fine-tuning them for everyone on my system, screwing around with whitelists, reassigning point values to something sane for every goddamned criterion, zeroing the point value assigned to double-or-triple-jeopardy criteria, muck about with dotfiles, and then I'll end up with useful SpamAssassin install because out of the box it's complete and total shit?

And when I'm all done, it'll work about as well as the two Outlook filters I have, which took me less than one minute to do?

Wow, I don't know how this fine product escaped notice.
mirrorshades radio - darkwave, synthpop, industrial, futurepop.
[ Parent ]
user is the problem (none / 3) (#32)
by aenima on Mon Feb 02, 2004 at 04:08:45 AM EST

Looks like mail and spam you receive are not the same as mail and spam received by anyone else in the world. So you have to reconfigure, yes, or ask SA team a special version for you. You're the only one to need it, maybe SA is not the problem.

[ Parent ]
Don't be so quick (none / 1) (#44)
by tzanger on Mon Feb 02, 2004 at 04:55:30 PM EST

Look at STATISTICS.txt

CLICK_BELOW has a S/O (spam-only/overall) hit rate of 0.93% -- it almost exclusively occurs only in spam according to SA's own tests.  Why it doesn't hit with a much harder score I don't know...  Even SA's own testing shows that it would be a good test.

[ Parent ]

Where is the problem (none / 3) (#34)
by Jim Dabell on Mon Feb 02, 2004 at 08:57:24 AM EST

So basically, what everyone here is telling me is that all I have to do is spend an hour fiddling around with configs

No, you are getting people telling you that the default scoring is effective for them and not as unreasonable as you make out, and that if it causes you problems, then you can fix it without dumping spamassassin altogether. It makes more sense to pick out the rules you don't like from spamassassin than to invent your own rules from scratch, as you seem to have done.

fine-tuning them for everyone on my system

No, fine-tuning them for those that don't like spamassassin's default behaviour. My users like spamassassin's behaviour as it is, and the impression I get is that most people are the same.

screwing around with whitelists

That happens automatically.

reassigning point values to something sane for every goddamned criterion

Didn't you already list that under "fiddling around with configs?"

zeroing the point value assigned to double-or-triple-jeopardy criteria

Once more, isn't that covered by "reassigning point values?"

muck about with dotfiles

And again, how is "mucking around with dotfiles" different to "fiddling around with configs"?

Hyperbole isn't very convincing.



[ Parent ]
no need to get over-excited (none / 1) (#36)
by seb on Mon Feb 02, 2004 at 10:02:41 AM EST

Other people using your Outlook rules would not get the same success as you, because not everyone gets the same mail as you.  Spamassassin attempts to have sensible defaults for the average user.  

As you probably know, the rules are scored against large corpuses of ham and spam using a genetic algorithm.  Against these corpuses they return 1% - 3% false positives with a threshold of 5.

Using a genetic algorithm on 4 large spam corpuses seems like a pretty reasonable way of evolving the scores to me.  Evidently it doesn't work for you because for some reason you don't fit the profile of an "average" user.

I can't really think of an alternative strategy. It seems to have worked well enough for everyone who has posted apart from you, so I think that the reasonable conclusion would be that you get unusual mail, rather than that spamassassin is broken.

[ Parent ]

Curious but (none / 3) (#23)
by cosmokramer on Sun Feb 01, 2004 at 11:57:54 AM EST

Did you read my comment :)  Your reply seems out of context?

[ Parent ]
Inaccurate (3.00 / 4) (#33)
by Jim Dabell on Mon Feb 02, 2004 at 08:46:31 AM EST

It assigns points if the "from:" header isnt' a real name. Such as, I don't know, "support@bellsouth.net" or "billing@earthlink.com" or something like that. Legitimate emails which now have 2 or 3 points assigned to them.

Two or three points? According to the spamassassin website, it assigns less than half a point for this rule.

I get a lot of email from people who just dont' configure their from: settings, so the default is to use whatever address it's sent from.

I don't, and I suspect you are in a minority here. In any case, it's simple to disable that check if you don't like it.

It assigns points for being listed in Razor2. Fine and good, I guess. But then it assigns additional points for the "confidence rating" of Razor2.

Razor2 is a high-quality test, it makes sense to score highly in these cases. How much legitimate mail do you expect to get a high-confidence rating from Razor2?

It assigns points for HTML mails. I don't know about you, but in jobs I've had, I've gotten lots of HTML emails, usually from brain-dead secretaries who like to send HTML mail so they can make their fonts pink and add pictures of puppies. I discourage HTML mail, but to tag it as spam is silly.

You're right, it is silly. But you are attacking a straw-man argument there, as spamassassin doesn't "tag it as spam". It assigns around 0.1 points to the score. Basically, it is saying that it is slightly more likely to be spam. To put it in context, you need to go all the way up to five points before spamassassin calls something spam by default - a tenth of a point is hardly a big deal.

It then assigns more points for how much HTML is included in the mail. For example, 2 points are added if the mail is "90 to 100% HTML". God knows why.

It's quite simple. On average, HTML spam is far less likely to include a plaintext alternative than legitimate email.

If a mail is 100% HTML, it's likely that it's just some idiot secretary or your mother-in-law.

Hell no. Mail clients configured to send HTML email by default almost always send text/plain equivalents along with the HTML. Spam tools, on the other hand, don't.

Here, it just tagged mail as spam because the "from:" ends in numbers. The mail in question was from a friend of mine whose "from:" does not, in fact, end in numbers. I just double checked the source. I have no idea why SA tagged that, other than the notion that it's crap.

You may have found a bug. A single bug report is hardly likely to qualify something as "crap". And the "from address ends in numbers" only assigns a single point to the spam score - your friend had to trigger other rules to get another four points before it was tagged as spam. That rule alone would tell spamassassin that the probability of your friend's mail being spam was 20%, or, in other words, probably not spam.

There are plenty of other documentable examples.

Then please go on, because the ones you have given so far make sense to me, and given that I have had a single (borderline) false positive in six months of using it at the default settings, and that it's caught tens of thousands of spam mails in the meantime suggests that it's working very well, thank you.

I have two filters. The first checks the subj: header for keywords like "viagra", "mortgage", "loan", "vitamins", "coupon", and about 25 others which I won't bother listing - I'm sure you can figure it out.

Does it catch things like v1.agra? If not, then, judging by the spam I get, it's pretty ineffective, especially if any of the keywords you check for show up in legitimate mail.

The second checks the body for the word "click". That alone catches 70% of the spam.

You're kidding? You complain about all-HTML email being unfairly scored highly because of possible false-positives, and then you say you implement a rule like that? I can guarantee you that if I had a rule like that, I would have lost quite a bit of legitimate email.

Furthermore, if you are using those two rules as a binary system, a mail is either spam or isn't, then you are far more likely to get false positives than as system using multiple rules and a probability rating. Spamassassin's behaviour allows for the possibility that legitimate mail will occasionally trigger a rule or two. Yours doesn't.



[ Parent ]
you can't poison a bayesian filter (none / 1) (#49)
by martingale on Tue Feb 03, 2004 at 03:38:16 AM EST

Not in the abstract, mathematical sense. Since most filters carrying that label aren't designed to be mathematically sound, ymmv.

A proper Bayesian trainable filter uses *all* the data it has ever seen, exactly once. That means each mail is used, and used once only. None of that retraining until no errors and other hairbrained schemes. Also, a proper system must use all the contents of an email, not just five or ten words chosen in some way or other.

If those desiderata are satisfied, poison words actually help distinguish spam from legitimate mail. The reason is that the relative frequencies of those words taken together becomes a signature which is different from the relative frequency signature of legitimate mail.

And the beauty is that, while the signature of legitimate mail is under a single user's control, the signature of spam mail is under the control of all spammers at once. So even if one spammer tries to engineer his word frequencies in some way, the other independent spammers will, through their own spam contents, destroy those carefully crafted frequencies. Too many cooks spoil the broth, as the saying goes.

[ Parent ]

Thanks for that (none / 0) (#52)
by pwhysall on Tue Feb 03, 2004 at 07:45:07 AM EST

I wasn't aware of the finer points of the way these filters worked. I was operating under the assumption that if you feed enough crap to a Bayesian filter, the thing will cease to be able to distinguish spam from ham. I didn't make the extra mental leap to the idea that an email message that has a signature based on the words, "goldfish carburettor philosophy cuisine" (for example) would actually be staggeringly unlikely to be anything other than spam.

Well, you know what happens when you assume :-)
--
Peter
K5 Editors
I'm going to wager that the story keeps getting dumped because it is a steaming pile of badly formatted fool-meme.
CheeseBurgerBrown
[ Parent ]

it's a bit more subtle than I described (none / 1) (#54)
by martingale on Tue Feb 03, 2004 at 08:35:33 AM EST

It's difficult to generalise to all filters, as they may all use slightly different algorithms. Here's a more precise version of what I'm saying, for what it's worth ;-).

Some filters take an email with 1000 (rounded up) words, and pick 5 or 10 words, computing a score. Suppose you have lots of good words and lots of bad words. The spammer put 900 random words in the email which may contain some of your good and some of your bad words. You're going to be picking 5 words, maybe 4 of which are likely to be from those random ones. So your analysis is going to be made mostly on a single spam word and four hopefully common words. At least, that's the idea for poisoning attacks(*).

Now if your filter is smart and uses all the 1000 words instead, you can do a decision based on 1000 small contributions instead of 5 small contributions. The variability in the result is much smaller, but why would it be better? If you ignore 995 words, you are effectively saying each of those words has a 50/50 chance of pushing the result either way. If you use them, each word may in fact slightly push one way or the other, but the overall score will most likely be not close to 50/50. Moreover, your spammer now needs to find 900 words which are all not too untypical of the user's word set, as opposed to letting the filter pick 4 of dubious interest. This is much harder to get right.

For example: MAKE MONEY FAST a the of in a the of in a the of in a the of in a the of in a the of in a the of in a the of in a the of in a the of in a the of in a the of in a the of in a the of in a the of in a the of in a the of in a the of in.

Pick 5 distinct words according to some rule. You'll get a few prepositions, which occur fairly frequently in anyone's corpus. But if you look at all the words, you'll see that the words "a", "the", "of" and "in" occur way more often than they would in anyone's word list.

The signature analogy I used can also be made rather precise. For Bayesian models, it's a path in the probability landscape of the model. If you know some information theory, you can get all sorts of estimates on the probabilities of equivalent paths. But I'm just handwaving here without a concrete algorithm.

In fact, the poisoning attacks are most likely designed for a completely different purpose: foiling spam filters which are based on detecting duplicate message bodies. When a server sees a thousand identical messages to random people, it's likely spam. By changing a single word in each email randomly, you get a thousand different messages which don't trigger duplicate detectors.

[ Parent ]

Spamassassin wacky scores (3.00 / 4) (#18)
by BenJackson on Sat Jan 31, 2004 at 06:24:58 PM EST

You're right about the SA scores being wacky, although they are getting better.  Originally the scores were hand assigned by whoever wrote the rule.  Obviously that person knew whether the rule was supposed to match spam or nonspam, and roughly how "bad" it was.  The number of rules reached a point where it was no longer feasible for a human to tweak all of the scores by hand.  A perl program was written that assigned scores using a genetic algorithm that tried to minimize the number of incorrect classifications on a large corpus of spam and nonspam.

The problem was that many rules were under constrained -- it didn't really matter what score they had because in the corpus they were dominated by other rules.  A few rules became 'bail out' rules for nonspam (like IN_REPLY_TO) -- they got large negative scores that made the values of the other rules the nonspam matched less critical.  The result is a set of scores that is very likely ideal for the corpus, but not that great for real life (and in many cases very easy for spammers to circumvent -- IN_REPLY_TO being a perfect example).  In more recent versions of SA the rules are more realistic because the genetic algorithm is restricted by hints.  Good enough that I deleted the majority of my score band-aids in my user_prefs.

I'm surprised SA still uses point scores.  It seems obvious to me that once the rules are evaluated the spamminess should be evaluated using bayesian techniques, just like they currently use for word frequencies.

(also note that spamassassin -d can be used to remove all sa added crap if you decide something is nonspam.)

[ Parent ]

Old version of SpamAssassin? (none / 1) (#40)
by labrown on Mon Feb 02, 2004 at 01:35:55 PM EST

I read the long conversation about your problems with SpamAssassin, and it sounds like you are living under an old version of SA. I've been using SA for a while now and have seen it mature rapidly over the last few versions. 2.6 and up have been very good and require very little tuning to get good results.

Check the X-Spam-Status header for the version of SpamAssassin. If it's not 2.63 ask whoever controls it to upgrade.

[ Parent ]

Configure it and it will work (none / 1) (#43)
by tzanger on Mon Feb 02, 2004 at 04:46:02 PM EST

I run SpamAssassin on a handful of domains.  One is the company I work for, the other is a 15k-user dialup ISP.  It's configured along with procmail to toss spam into user's "Spam" IMAP folder, and I have a squirrelmail plugin I had made to let users alter their threshhold and black/whitelist settings.

It tags probably 90% of the spam I get with zero false positives.  If I spent some more time properly training the Bayesian filters and other goodies I would have full confidence in hit hitting damn near 100%.

Perhaps you just need to configure SA properly.  From your subsequent posts on the subject it looks like you threw it on your system, wemnt "Hmph, look at these crap headers, what's it doing to my subject line, jesus this thing applies anal pneumatic pressure" and tossed it.  There are tons of domains using this very effectively.

[ Parent ]

What spamassassin version are you running? (none / 0) (#56)
by batkiwi on Thu Feb 05, 2004 at 02:23:32 AM EST

It's not the newest if it's using body reporting by default, that was phased out LONG ago.  Upgrade to 2.6x.

THe # scoring of spamassassin is great.  I use it with procmail and give anything >=9 to /dev/null, anything >=5 and <9 to my "likely spam" folder.   I then feed anything in likely spam to sa-learn --spam if it is spam, and sa-learn --ham if it isnt.

SpamAssassin blocks ~25 spams for me per day, and I've not had a SINGLE false positive in over a year.

I run my own mailserver, and one of my users gets over 200 spams a day, and she has the same luck I have.

Also, any of your friends should end up on auto-whitelists after about 4 emails to you that aren't spammy.

[ Parent ]

Spamassassin Configuration Tips (3.00 / 7) (#19)
by BenJackson on Sat Jan 31, 2004 at 06:51:15 PM EST

Solicited commercial email is one of the trickiest categories of mail to identify. Some companies have learned not to include breathless promises of free merchandise with their sales receipts, but many still run afoul of rules aimed at spam. The best way to avoid these false positives is to create personal rules that key in on data that most spammers don't have, or at least don't bother to mail-merge into their spam. Here are some things you can add to your ~/.spamassassin/user_prefs:
  • One of the most successful rules I have for keeping receipts and order tracking information out of my junk folder is just my address.
    body MY_ADDRESS /1600 pennyslvania ave/i
    score MY_ADDRESS -15
  • If most spam that identifies you by name uses a nickname, try adding a rule that matches your full, given name (that you use as your billing address, for example):
    body MY_FULL_NAME /Benedict/i
    score MY_FULL_NAME -2
  • Some forms identify the IP address that submitted the form. If your IP address is somewhat stable (even most cable and DSL dynamic IPs don't change that often) a rule that matches your IP (separated by either dots or dashes):
    body MY_IP /12[.-]224[.-]233[.-]146/i
    score MY_IP -5
  • ...and anything else you fill out in legitmate order forms that might be emailed back to you, like phone numbers.
If you are handy with regexps you can make other rules that match keywords related to your work or hobbies. I have a rule that matches aircraft N-numbers, for example. That'd be useless in the 'core' spamassassin ruleset, since it would be easy for spammers to defeat. As a personal rule it's very effective. Consider rules that match the formats of trouble ticket IDs, product IDs, serial numbers, etc. Pure Bayesian filters can't learn to generalize those formats from many examples, so a regexp rule helps a lot.

Other personal rules you should consider are rules that score heavily against the 'wrong' versions of your name (I'm Benedict, Ben for short, but not Benji, Benny, Ben ben, or Benjamin, all of which frequently occur in mail-merged spam). I also score against mail which is delivered to accounts which forward to me but are not my primary address.

You don't have to send everything to spamassassin. (3.00 / 3) (#25)
by waxmop on Sun Feb 01, 2004 at 01:12:54 PM EST

The beauty of procmail + spamassassin is that you can choose what gets screened. In my case, I only submit email to spamassassin if I don't recognize the From address.

:0 fw
* !From:.*(friend1|friend2|...|friendN)
| spamassassin -a

--
We are a monoculture of horsecock. Liar
[ Parent ]

Results ? (none / 1) (#26)
by bugmaster on Sun Feb 01, 2004 at 01:27:37 PM EST

How effective is this approach ? Has anyone tried it ? What is the ratio of false positives ? At home, I use PopFile, because, being a mail proxy, it's trivial to install. PopFile is pretty good, but an occasional spam still slips through about once a week now, hence my interest.
>|<*:=
popfile only works for pop mail users. (none / 2) (#30)
by waxmop on Sun Feb 01, 2004 at 05:58:49 PM EST

And everybody knows that if you're only using pop mail, you might as well stick with AOL. Real h4x3rs run their own mail servers.

No, just kidding, but really, popfile is a great tool, but it isn't the only tool.
--
We are a monoculture of horsecock. Liar
[ Parent ]

System resources. Use bmf instead. (none / 2) (#31)
by rrm3 on Mon Feb 02, 2004 at 03:06:18 AM EST

I stopped using SpamAssassin because using all the tests required to make the filtering effective made my 166MHz/32MB grind almost to a halt. I've found bmf (Bayesian Mail Filter) to be, not only much fast but also more effective once it's trained properly.
-- 
Sunken Basement
easy server-side filtering (none / 1) (#35)
by seb on Mon Feb 02, 2004 at 09:44:58 AM EST

For filtering stuff at the server, it's worth looking at MailScanner.  It integrates with clamav and spamassassin and also does things like textifying html attachments, refusing to deliver web bugs, sending virus notifications, adding disclaimer text and all the other often irritating things which business email gateways often do.

But it's pretty good if that's what you want.

Worried about false positives? (none / 1) (#37)
by jynx on Mon Feb 02, 2004 at 10:17:14 AM EST

Even the best spam filter occasionally tags legitemate mail as spam.

I have a similar system to the author of this story, but I've also thrown ASK into the mix.

When SpamAssassin tags an e-mail as spam, ASK holds it in a queue and sends a confirmation request to the e-mails sender.  If the e-mail is legitemate, the sender can reply to this confirmation request to have the e-mail delivered.

This filter has stopped about 500 spams a week for the last 3 months, and I have still yet to have a spammer reply to the confirmation request.  On the other hand, I've had a handful of people who have sent legitemate e-mails, which would have been filtered by SpamAssassin, but have replied to the confirmation e-mails to have their message delivered to me.

I also wrote a couple of small scripts. One expires messages queued for confirmation after two weeks, and feeds the expired messages into sa-learn, to improve SpamAssassins classification.

The other automatically scans my sent mail and read mail folders and adds all the e-mail addresses to my ASK whitelist.  I make sure that any spam which gets past the filter doesn't end up in my Read mail folder, so this ensures that anyone who has sent me mail, or I have sent mail to, is exempted being queued.  My read mail folder is also fed into sa-learn by a cron script.  This ensures that any e-mails which were queued, but later confirmed by the sender are used to improve SpamAssassins classification.

Overall, this gives me a system which is virtually maintenance free.  I don't need to check my spam folder for false positives.  The only e-mails I will miss are those wrongly classified as spam by SpamAssassin (a tiny percentage) and not confirmed by the user (a tiny percentage of a tiny percentage).

The only intervention I do is copying spam which SpamAssassin wrongly classifies as not-spam into my spam folder.  This is easy, as I've configured a key in mutt to do it, and it only happens a couple of times a week now.

OTOH, I don't use a virus scanner.  I find that SpamAssassin learns about virus pretty quickly, and tags them as Spam, so I don't see them.

--

Challenge-response considered harmful (none / 1) (#38)
by kmself on Mon Feb 02, 2004 at 12:34:33 PM EST

Please don't do that.

Most spam forges 'From' address. Because undeliverable domains would be easy to detect and filter, legitimate domains are used. Meaning that mail sent to the 'From:' header of spam does go to someone.

Viruses and other forms of mail abuse only make this problem worse.

If I get a challenge from someone based on spam or viral mail, I approve it. Think about that.

I've been having an extended conversation with Brad Templeton over the past few days about the problems of challenge response. While he doesn't agree with all my complaints, he does concede that C-R was never meant to be a silver bullet, and that measures should be taken to minimize false positives, above and beyond his existing best practices recommendations. Actually, it looks like he has updated these:

  • Avoid challenging virus mail and other forged mail
  • Make use of other authentication tools
  • Combine with other spam algorithms
  • Spammers may try to fake the things you detect

If SA tags mail as spam with a score of 7 or more, odds are something like 10,000:1 that it is spam. I've got a distribution of SA scores for spam and ham, and I've got one false positive above 6.

Challenge-response (what ASK, TMDA, Mailblocks, Earthlink's "high" spam filter setting, and Microsoft's latest spam "solution" are) isn't a spam mitigation solution, it's a spam distribution solution. You're giving your spam to someone else, and asking them to evaluate it for you.

My response to this: no thanks.

The results are predictable. SpamCop routinely lists MailBlocks as a spam source because of challenges sent to SpamCop honeypots -- see the discussion / newsgroup archives. The tmda-user list is full of people talking about the thousands, tens of thousands, or hundreds of thousands of challenge messages they've sent. There's some risk that the 92% of Earthlink subscribers not using C-R will find their mail filtered on account of blacklisting of Earthlink mailservers based on bogus C-R challenges.

If you need to whitelist youir mail, do it yourself. Check your spam filters, don't discard the mail immediately, or, if you can, reject spam at SMTP time.

For more information: Challenge-Response Considered Harmful

--
Karsten M. Self
SCO -- backgrounder on Caldera/SCO vs IBM
Support the EFF!!
There is no K5 cabal.
[ Parent ]

C-R (none / 1) (#41)
by jynx on Mon Feb 02, 2004 at 01:40:47 PM EST

Although in part, I see your point, I disagree with your conclusions.

Meaning that mail sent to the 'From:' header of spam does go to someone.

My own research shows this not to be true.  The overwhelming majority of challenges go to non-existent addresses.  Which makes sense, because most of the spam I get, although the domains are valid, the part before the @ is generally random junk.

So, it is true that this is putting additional burdon on the domain spoofed, but this is pretty small compared to the overall burdon of spam, and is no different from is my e-mail address didn't exist - the spoofed e-mail address would simply get undeliverable bounces instead of challenges.

If I get a challenge from someone based on spam or viral mail, I approve it. Think about that.

The virus goes into my inbox, and you get whitelisted.  Doesn't seem like such a big deal to me.

For more information: Challenge-Response Considered Harmful

This article doesn't convince me.

Point 0, while pehaps true IN THEORY, simply doesn't apply in the real world.  As I pointed out, a spammer has NEVER replied to any challenge sent out by my system.

Assuming that everyone had C-R (which is much of the basis of the article), spammers who wanted to bypass C-R filters would incur huge costs compared  - they would actually have to send using legitemate e-mail methods, and would therefore have to actually pay for the bandwidth to send their spam (several times over).

Point 1, is pretty hazy.  I don't see how it really applies.  As it clearly states "At a practical level, the goal is to minimize the amount of spam received, while ensuring no (or the very minimum) of legitimate mail is lost."  My own experience has found that methods which simply ignore spam mean that I lose mail.  If I have to manually check all the e-mails in my spam folder, what exactly is the point of having a spam filter?  I still waste as much time checking my spam folder as I would deleteing the spam from my Inbox.

Point 2 is wrong.  C-R DOES place the burdon on the spammer - the burdon being if they want to get  spam through, they have to use a legitemate e-mail address so they can receive the challenges, and pay for the bandwidth used.

Point 3 is nonsense.  There is no reason why C-R systems need to be on the mail server.  Even if C-R DID have to be done on a mailserver, this would still be nonsense.  The mailserver has to queue the e-mail while waiting to deliver it to me even without a C-R filter.

Point 4 doesn't apply to my setup.  My whitelist is mostly done by me (albeit automatically) by scanning my sent/read mail folder.  Only a handful of people have actually ever needed to reply to a challenge to get an e-mail through.

Point 5 is irrelevant, as my system chains spamassassin with C-R.  Challenges are only sent for messages which are thought to be spam.

Point 6, while theoretically true, seems practically highly unlikely.  I could DOS someone by sending out a large amount of spam with their address causing them to be deluged with challenges.  But this implies I have the capacity to send out a large amount of spam - I could just DOS them by sending the mail directly to them, or sending out spam to known incorrect addresses with their addresses so they get all the bounces.

Point 7 is wrong.  This condition is trivially detected and prevented by not sending multiple challenges to the same address before the first is  acknowledged, which is implemented.

Point 8 is wrong.  What would be the point of making a list of e-mail addresses which have a filter preventing you from spamming them?

Point 9, possible true I will concede.  But then, I was being spuriously blacklisted by spam blacklist services long before I installed this filter, I certainly haven't noticed things getting any worse.

Point 10, wrong.  So long as the mailing list is properly configured, (and the C-R system isn't stupid, which ASK isn't) this doesn't happen.  I'm on a dozen or so mailing lists, and I can assure you that I've never sent a challenge to any of them.

Point 11, TRUE!  But not an argument against C-R.  Any recipient side filter is exactly the same in this respect.  If I could, I would address the techno-economic underpinnings of spam, but I can't, so I'm far more interested in being able to use my e-mail.

--

[ Parent ]

C-R: Worse than you thought (none / 2) (#46)
by kmself on Tue Feb 03, 2004 at 01:17:15 AM EST

Let me start with the following quote:

C/R is just a prong. I will admit to helping this confusion because for the first 5 years or so, C/R was on its own a fully sufficient (and in fact best) anti-spam tool. I never held the illusion you would want to rely on just that forever.

That's Brad Templeton, one of the earliest proponents of C-R and authors of a C-R system. He admits that it's not sufficient on its own, and that the bogus challenge problem (my primary objection) is a serious one. He recommends using C-R as part of a suite of tools. He recommends minimizing bogus challenges. Where we differ is that I feel once you've elimitated viruses, DNSBL sources, obvious spam, and previously whitelisted mail, you've got no need to ask someone else to challenge the remainder -- a bare handful of messages daily.

Assuming that everyone had C-R (which is much of the basis of the article), spammers who wanted to bypass C-R filters would incur huge costs compared - they would actually have to send using legitemate e-mail methods, and would therefore have to actually pay for the bandwidth to send their spam (several times over).

C-R adoption is likely to remain relatively low for a time. However moves by Earthlink, Microsoft, and/or others could change this very rapidly. With as few as 5% of the Net using C-R, you could expect to see spoofed challenges daily or more often, while a legitimate challenge only appears once a month. C-R relies on a nondeterministic function: the response of the person challenged to the challenge. If they either ignore legitimate challenges, or spite you with bogusones, the system fails.


Point 1, is pretty hazy. I don't see how it really applies. As it clearly states "At a practical level, the goal is to minimize the amount of spam received, while ensuring no (or the very minimum) of legitimate mail is lost." My own experience has found that methods which simply ignore spam mean that I lose mail. If I have to manually check all the e-mails in my spam folder, what exactly is the point of having a spam filter? I still waste as much time checking my spam folder as I would deleteing the spam from my Inbox.

First: spam detection != spam tagging. There are many ways of dealing with spam.

It's far easier to quickly scan through a spam folder looking for falsely tagged mail (all of it was spammy in the first place, right) than to look through an uncategorized mailbox and sort out the non-spam stuff.

You can also apply filters or rules within the spam folder. I highlight spam by its score, from green (low spam score) to blue to yellow to red, as the score increases. You could alternatively sort mail into low, middlin', and high spam mailboxes.

As I said: tagging and sorting mail is only one approach. Far better is to simply reject spam at SMTP time by various characteristics (DNSBL, content/context filters, etc.). Legitimate senders immediately know that their mail was rejected (and preferably why). You don't send misdirected bounces or challenges to third parties. And you don't have to sort through the chaff. With suitably tuned filters, your false positive rate is low.

Because you are rejecting the mail, it's not "lost" by the system, but clearly indicated as having tripped an error condition.

Point 2 is wrong. C-R DOES place the burdon on the spammer

No, you are wrong.

C-R places a burden on the presumed spammer. You don't know that a mail is spam, you're accusing the listed sender of spamming. Based on highly spoofable, and unvalidated information.

The likelihood that this reaches a spammer is low.

Spammers can (and are) responding to the problem already. Consider the "$40 Nigerian solution":

Spammer sends out 1 million emails. 1% of recipients use C-R. Spammer gets 10,000 challenges. These go to an ISP in Nigeria which is very happy to be paid the big bucks by the spammer to provide various services. There are five "email validation response technicians" paid the princely wage of $1/hr (160% the national average wage) to respond to four challanges a minute, 60 minutes an hour, eight hours a day. The net increased cost for 1 million spams is $40.

And if more than 1% of email users have C-R, we're all innundated with bogus challenges and blacklisting one another's SMTP servers.

In a world with depressingly cheap labor, highly corrupt countries (Nigeria is among the worst in the world), and companies desperate for cash, C-R loses.

I was first told of this scenario by none other than Earthlink's own abuse manager, Mary Youngblood, personal phone conversation, fall of 2003. She'd gotten it from email marketers themselves.


Point 3 is nonsense. There is no reason why C-R systems need to be on the mail server.

You're intentionally missing the point. For any user of Earthlink, Mailblocks, or a Microsoft C-R system, the whitelist will be on the mailserver. Sure, tech-savvy folks can and do implement their own locally managed C-R systems, but they're going to be overwhelmingly the minority. Sure, there's no need. But the practical necessity is that this will be effectively always the case.

Even if C-R DID have to be done on a mailserver, this would still be nonsense. The mailserver has to queue the e-mail while waiting to deliver it to me even without a C-R filter.

A mail queue is held for a few seconds or minutes. Undeliverable messages may reside for as much as four days under typical configurations.

ISPs aren't in the business of, and have few compelling business reasons to, retain logs for more than a short period of time. A few days or weeks, typically. In general, no more than a billing cycle. By contrast, C-R requires a comprehensive, permanent, subpoenable, crackable list of all your correspondents be kept online. And for 99.99% of C-R users, that will be on their ISP or mail service provider's server.

Brad Templeton's failure to grasp this issue, while he heads the very privacy-conscious EFF, is one of life's delicious ironies.


Point 4 doesn't apply to my setup.

See above. You are, effectively, nobody. In the general case, C-R whitelists are generated by challenges. See above.

You are also directly affected by the general perception of C-R by challenge recipients. If people stop responding as you expect them to, your system breaks. There is no way for you to engineer around this from within the context of C-R. This is the primary reason I state that C-R is broken by design.


Point 5 is irrelevant, as my system chains spamassassin with C-R. Challenges are only sent for messages which are thought to be spam.

Oh, good, so you're only challenging mail that is HIGHLY likely to spoof the sender address, and comprises (if you're typical) 60%+ of your email volume.

I'm so reassured.

And: you're nobody. See above.

And: you're directly contributing to the perceived annoyance factor associated with C-R challenges yourself. See above.


Point 6, while theoretically true, seems practically highly unlikely.

You're not only intentionally missing the point, you're fully ignorant of the facts. SoBig spoofed addresses within a small set of domains, including microsoft.com, msn.com, and ms.com. That last isn't a Microsoft domain, but belongs to Morgan Stanley Dean Whitter. Swen didn't spoof domains from within a small block, but picked arbitrary addresses. MyDoom is doing similarly. Spam likewise spoofs my address with alarming regularity (the first such occurance was the reason I started GPG signing all my email).

If you look at the tmda-users list at the time of the SoBig outbreak, you'll find users bragging, yes, bragging about sending out thousands, or hundreds of thousands of challenges based on SoBig mail.

All of which went to Microsoft and Morgan Stanley.

How is this not a Joe-job?

I could DOS someone by sending out a large amount of spam with their address causing them to be deluged with challenges. But this implies I have the capacity to send out a large amount of spam - I could just DOS them by sending the mail directly to them, or sending out spam to known incorrect addresses with their addresses so they get all the bounces.

So...you're saying a Joe-job isn't a Joe-job if there are more effective ways to accomplish the same task directly. Note that a spammer can effectively multiply their outbound capacity by specifying multiple recipients on a single outbound mail. When sent to a system implementing C-R, each recipient generates a separate challenge mail. C-R is a spam multiplier.


Point 7 is wrong.

It's documented. It's not common in well designed C-R systems. It is possible.

This condition is trivially detected and prevented by not sending multiple challenges to the same address before the first is acknowledged, which is implemented.

This is a case of what I call the "But a well-designed system won't do that" objection. The problem for you, as a C-R user, is that when I get a challenge, I've got no idea if I'm dealing with a well-designed system or not. Nor do I care. Nor, as I point out, do you have any business challenging my mail in the first place, as you've got plenty of basis for determining the legitimacy of my mail and identity.


Point 8 is wrong. What would be the point of making a list of e-mail addresses which have a filter preventing you from spamming them?

I'm shocked, shocked, but you've once again missed the point.

See above: I've no idea if a C-R system is well designed. It's moderately difficult (and at times impossible) to determine if a challenge was legitimately sent to me or not. There's no reason that a spammer wouldn't utilize the social engineering trick of disguising email harvesting mail as C-R challenges, in the same way that current phishing tactics spoof eBay and bank websites, or that viruses emulate MTA bounce messages.


Point 9, possible true I will concede. But then, I was being spuriously blacklisted by spam blacklist services long before I installed this filter, I certainly haven't noticed things getting any worse.

Think about this: if your own spam prevention system is getting you blacklisted, isn't there something seriously wrong with your approach? As I've said, both MailBlocks and Earthlink are showing up on blocklists. The tmda-users list has evidence of people's challenges being forwarded to SpamCop. Justin Mastaler responds by calling SpamCop "overzealous". Yeah. Right.


Point 10, wrong.

You'll be happy to know, I'm sure, that it was an ASK challenge received in response to a mailing list post which triggered my writing that rant in the first place.

So long as the mailing list is properly configured...

See the "But a well-configured system..." objection above.


Point 11, TRUE! But not an argument against C-R.

If C-R doesn't work, has multiple faults, and doesn't shift the balance in the spam war, why is this not an argument against C-R?

Any recipient side filter is exactly the same in this respect.

Wrong. The increase in the amount of obfuscated, misspelled, dyslexic, popcorn, and similar spam mail is a direct response to the increasing effectiveness of Bayesian filters. Most of these tricks don't work. A small fraction have slipped through my filters since I started seeing them on Dec 17, but are trapped by rules which look for structural characteristics of spam that the obfuscation relies on. And any human reader looking at the subject or message simply sees the mail as spam.


If I could, I would address the techno-economic underpinnings of spam, but I can't, so I'm far more interested in being able to use my e-mail.

Teergrubing, DNSBL, SPEWS, QoS rate throttling, SMTP reputation systems, Bayesian and other filters, all do address the underpinnings, directly, with very little negative fallout.


--
Karsten M. Self
SCO -- backgrounder on Caldera/SCO vs IBM
Support the EFF!!
There is no K5 cabal.
[ Parent ]

C-R objections don't necessarily apply (none / 1) (#42)
by koreth on Mon Feb 02, 2004 at 03:32:50 PM EST

Most objections to C-R seem to be based on the underlying assumption that a large percentage of legitimate E-mail messages trigger the C-R system. The arguments lose most of their relevance, in my opinion, when that's not the case.

I've been using a C-R system on my mailbox for two years, but like the grandparent poster, it only gets triggered for borderline cases -- if a piece of mail doesn't look like spam, it goes through unhindered and its sender is placed on my whitelist, and if it looks too much like spam (equivalent of a high SpamAssassin score) then it gets discarded.

My experience, based on occasional analysis of my mail logfiles, is that between 1 and 2 percent of my new correspondents get hit with a challenge. And since the vast majority of my mail is from people who have already sent me at least one message in the past, it works out to well under .1% of my incoming non-spam messages. More than that of the spam, but almost all of my spam is from nonexistent sender addresses anyway, so the challenges bounce immediately.

In the two years I've been using this system, I've had, I think, three spammers reply to my challenges. And I've had no "What? I didn't send you anything!" responses.

I know I wouldn't mind answering a challenge for one out of every 75 or so messages I send out, because for every person who has a (well-configured and well-behaved and part of a larger spam filtering solution) C-R system, there's that much less economic incentive for people to send out spam.

[ Parent ]

That's not my objection (none / 1) (#51)
by kmself on Tue Feb 03, 2004 at 06:07:34 AM EST

Most objections to C-R seem to be based on the underlying assumption that a large percentage of legitimate E-mail messages trigger the C-R system. The arguments lose most of their relevance, in my opinion, when that's not the case.

Not mine.

In a world where 18 billion spam messages are sent daily (and if that's not today's value: 600m users, 50 messages each, 60% spam), it will be a soon-to-be tomorrow's), and even a small fraction of spam spoofs an individual's address, I can be pretty certain that some stranger somewhere is receiving mail from another stranger somewhere, with my name on it. And in a world where even a small fraction of those strangers have C/R, I'm going to be seeing a challenge a day. Or four. Or eight. Or sixteen. Or...

All bogus.

And when your legitimate challenge comes in. As all the C/R proponents say -- I should only see one legit challenge a month. Well, the other 480 bogus challenges will have pretty well trained me, and me my mail filters, to treat it as the trash it is.

So, if you've got mail in your mailbox, and it says it came from me. Well. That's your problem to deal with. Not mine.

The fact that you've configured your C/R system in what appears to be a reasonably sane fashion is all very well and good. I've got no flippin clue you've done this though. And you're going to suffer for every single bogus challenge that somone else's system sends -- training the recipient that C/R is spam. See my earlier response on this thread for additional problems.

Have a good day.

--
Karsten M. Self
SCO -- backgrounder on Caldera/SCO vs IBM
Support the EFF!!
There is no K5 cabal.
[ Parent ]

Umm... (none / 0) (#59)
by Kenoubi on Thu Feb 05, 2004 at 11:14:16 AM EST

I can't understand how bogus challenges should be a serious problem, if both sides are aware of the existence of challenges. Your mail client should be able to keep track of what messages you've sent, and match up whether a challenge message is from someone to whom you've recently sent email. If not, the challenge could just be dumped. Yes, it would still waste some system resources, but it would also force spammers to use legitimate from addresses to get their spam through (actual legitimate addresses that they own, not just legitimate in the sense that there's someone reading them).

Similarly, SpamAssassin was tagging messages to me that looked like bounces as non-spam. Obviously, spammers adjusted to this and started sending tons of spam that looks like bounce messages, so I adjusted the score for “looks like a bounce” as far in the other direction as I could without classifying legitimate bounces as spam. What I wish it would do is remember if I've recently tried to send email to someone, and use that to determine whether a bounce message I'm receiving could possibly be legitimate.



[ Parent ]
Still wrong (none / 0) (#60)
by kmself on Thu Feb 05, 2004 at 01:05:27 PM EST

I can't understand how bogus challenges should be a serious problem, if both sides are aware of the existence of challenges.

Check your premises.

Your mail client should be able to keep track of what messages you've sent, and match up whether a challenge message is from someone to whom you've recently sent email.

OK, here's a clue.

Mine doesn't.

Here's the other clue: most don't. To the point of practicality: no email client currently provides this capability. You're arguing based on a presumption which is prima facia false. And in practicality, won't be implemented, even if all MUA developers crack on it now, for years. Most if not all of a decade. Call me in 2014 and tell me where we're at.

Sure, it can (and in my case does) file all sent messages to a folder. Which I can search through. Manually.

There's nothing inherent in email responses that guarantees I'll be able to match up a given challenge with a given sent mail. This requires that the challenge contain some key corresponding to the sent mail, the best (but still not foolproof) one of which would be a message ID.

I currently get challenges which quote all (or none) of the sent message, which are sent from the address I (or the spoofed mail) mailed (or an alternate address from which all challenges are sent, or an address expanded from an alias, or an address expanded from a mailing list).

I may send mail from multiple systems (including more than one automated system which sends mail on my behalf, handhelds, accounts on other systems, etc.) for which I don't have the original message.

Your premise is, in other words, wrong today. It's wrong tomorrow. It's not a technically soluble problem with any level of assurance, without rearchitecting email in a way that's wholly incompatible with how it works today.

Have a nice day.

Next.

--
Karsten M. Self
SCO -- backgrounder on Caldera/SCO vs IBM
Support the EFF!!
There is no K5 cabal.
[ Parent ]

Ah. (none / 0) (#64)
by Kenoubi on Mon Feb 09, 2004 at 09:36:41 AM EST

So any solution which requires new features be implemented in the client is unacceptable.  While you're at it, why not go further and say that any solution that requires changes to any piece of software anywhere is unacceptable?  In that case, gee, I guess the spam problem really is insoluble.

I was never aiming for a perfect solution, which I acknowledge that challenge/response isn't and never will be.  But almost nobody trying to solve the spam problem is aiming for that, anyway.  On the other hand, I should point out that if my email client at least attempted to keep track of what messages I've sent, and used the In-Response-To header as an effective whitelist, and assumed that anything that looks like a reply or especially a challenge or bounce that isn't in reply to any specific message I actually sent is almost certainly spam, this would help the spam filtering problem on my end without anyone else having to deploy any additional software.

I can't see how keeping track of what messages I've sent and using that as an additional factor (not the only factor) in determining whether messsages that look like they were generated in response to one of mine (replies, challenges, bounces) is anything other than making use of useful information that not doing so would unnecessarily ignore.

[ Parent ]

You're ignoring the problem (none / 0) (#65)
by kmself on Tue Feb 10, 2004 at 12:02:29 PM EST

So any solution which requires new features be implemented in the client is unacceptable.

My mouth, my words. Please don't commit the unpardonably rude practice of putting your words in my mouth.

From a pragmatic standpoint: solutions which localize changes, and minimize the burden on innocent third parties, are preferred over those which require a coordination of changes, and grossly impact third parties.

A spam countermeasure which requires me to change my client or server software, but conveys the benefits of doing so to me, is reasonable. A countermeasure which requires everyone else to change their software for my benefit is not.

While you're at it, why not go further and say that any solution that requires changes to any piece of software anywhere is unacceptable? In that case, gee, I guess the spam problem really is insoluble.

I'll just note that you've utterly sidestepped the problem at hand: mail clients don't provide any current functionality to match challenges with sent mail. Nor is there any clear way to do so for a person who uses multiple email systems. So I guess in your roundabout way, you're conceeding my point.

Thank you.

I was never aiming for a perfect solution, which I acknowledge that challenge/response isn't and never will be. But almost nobody trying to solve the spam problem is aiming for that, anyway.

Perfect solutions don't exist. That leaves us with varying levels of imperfect solutions.

A response to spam which quite arguably doesn't work particularly well, puts a gross obligation on many others, and which causes email to become less reliable and silently broken, is pretty unambiguously bad.

That in a nutshell is C/R.

Selectively used blacklists work. Locally maintained whitelists work. Expedited processing of known good SMTP servers works. Teergrubing and firewalling suspect or known bad hosts, networks, ISPs, or ASNs, works. Content / context based filtering, whether adaptive (Bayesian) or rules base, works. The combined approach deals with large volumes of spam pretty well.

On the other hand, I should point out that if my email client at least attempted to keep track of what messages I've sent, and used the In-Response-To header as an effective whitelist, and assumed that anything that looks like a reply or especially a challenge or bounce that isn't in reply to any specific message I actually sent is almost certainly spam, this would help the spam filtering problem on my end without anyone else having to deploy any additional software.

Extend this to a specific tool that can be used by all major mailers. Then talk to me.

You still haven't demonstrated why my dealing with your spam is my problem.

--
Karsten M. Self
SCO -- backgrounder on Caldera/SCO vs IBM
Support the EFF!!
There is no K5 cabal.
[ Parent ]

virus != spam (none / 1) (#45)
by polyglot on Mon Feb 02, 2004 at 08:45:39 PM EST

problem is, viruses come from friends (well, depending on the kinds of friends you keep :)

If your friends are in your whitelist, spamassassin won't bin virus emails from them. Not that you're going to catch anything in mutt anyway....
--
"There is no God and Dirac is his prophet"
     -- Wolfgang Pauli
‮־
[ Parent ]

.procmailrc before .forward (none / 1) (#39)
by rustv on Mon Feb 02, 2004 at 12:44:46 PM EST

I'd reccommend making sure .procmailrc is set up and debugged before pointing .forward at it.  It's possible to bring down a pretty big system if you're not careful.  I know this from experience.  In my case, I didn't have the permissions set correctly on the .procmailrc, and the university computer barfed all over the place.

____
"Don't tase me, bro." --Andrew Meyer
Spambayes (none / 1) (#47)
by irrevenant on Tue Feb 03, 2004 at 03:01:54 AM EST

Personally, I've had fairly poor results with Spamassassin.  I found Spambayes (http://spambayes.sourceforge.net/) to give much better results.

Warning: Apparently it doesn't work as well for languages other than english...

obligatory bayesian comment (none / 2) (#48)
by martingale on Tue Feb 03, 2004 at 03:26:42 AM EST

I hesitate to use the word "Bayesian", because most filters which call themselves that don't really do Bayesian mathematics properly. However, Bayesian trainable filters are, many of them, ideally suited for a procmail recipe.

A quick search on Freshmeat turns up 19 "Bayesian" filters, unfortunately not in order of relevance or sophistication. There are at least five or six which can be used directly as command line tools, hence can be integrated in a procmail recipe just like SA can: bogofilter, bmf, antispam, annoyance filter, ifile, bayespam, spamprobe, dbacl (all of these are "fast" - if you want slow ones written in perl or python, there are others like spambayes, popfile etc, but they are not necessarily designed for the command line).

Why use them? The ones that are written in C are much faster and lighter weight than SA, the tweaking interface is more userfriendly (instead of choosing rules and score values directly, you choose examples of spam and nonspam mails from your archives), and they adapt much more quickly to changing email streams, without requiring a software update.

Some of the Bayesian filters listed can classify your mail into many more than two categories. Some can estimate the classification accuracy ("unsures").

Most schemes the spammers have come up with to break trainable filters actually backfire and make the spam more easily recognizable. For example, the nonsense words included in emails actually make the mail stand out for a trained filter. If the filter implementation is mathematically sound, then it is impossible to "poison" its database. Instead, the filter picks up the mere fact that an email is "trying to poison" as a clue against it.

Other techniques such as obfuscation also backfire, because either the filter deobfuscates and reads the spammy message, or it fails to deobfuscate and learns that messages carrying obfuscation are spam carriers.

Once in a while, spammers find a place within the email header/body which isn't scanned by the most popular filters (for example, html tags are usually not scanned), and are able to insert their message in those cracks. In that case, the Bayesian filter doesn't "see" it, and is fooled - until the filter writers realize the problem and fix their scanners. This is quite rare now, and is certainly doomed to fail in the long run. That's because, even if an obscure trick can always be found, the rest of the message surrounding it is still not likely to have the same character as legitimate mail for each individual user.So the obscure trick, while hiding the spam, still radiates a stink ;-)

I have a nit to pick... (none / 0) (#57)
by warrax on Thu Feb 05, 2004 at 07:00:20 AM EST

The ones that are written in C are much faster and lighter weight than SA, the tweaking interface is more userfriendly (instead of choosing rules and score values directly, you choose examples of spam and nonspam mails from your archives), and they adapt much more quickly to changing email streams, without requiring a software update.
Umm... SA has had Bayesian (yeah, yeah, not really Baeysian, but whatever...) learning for quite some time now. It works just as the other filters, but instead of being the only filtering criterion, the probability of spamminess that the filter assigns is part of the total score of the email, i.e. giving you the best of both worlds.

-- "Guns don't kill people. I kill people."
[ Parent ]
nit back to you ;-) (none / 0) (#58)
by martingale on Thu Feb 05, 2004 at 10:49:09 AM EST

Umm... SA has had Bayesian (yeah, yeah, not really Baeysian, but whatever...) learning for quite some time now. It works just as the other filters, but instead of being the only filtering criterion, the probability of spamminess that the filter assigns is part of the total score of the email, i.e. giving you the best of both worlds.

In some sense, SA has always been a learning filter. You can think of humans adding their own criteria as a crude form of learning. Then they had an elaborate scheme using genetic algorithms to pick good rules. Or maybe I'm confused with another filter. Nope, I'm right.

So far, so good. All the Bayesian filters I listed also integrate a large number of rules. The only difference is that each rule is much simpler. Instead of features such as "regular expression x in the From: field", the features are "word x somewhere in the body". Still, that's not what I'm driving towards.

If SA manages a set of features by assigning good weights to each feature, then the same is true of Bayesian filters. Except that the SA people choose their weights to be good for a large cross section of people, while the Bayesian filters choose their weights for a single individual.

The trick for SA is to get their Bayesian weights to work well with the other weights, in a consistent way. I'm not convinced they do that. If the two sets of weights don't jibe well together, then you'll get some rules which are often or always overshadowed by others. Some rules may in fact weaken the power of other rules, if they are correlated in the wrong way.

Pure Bayesian systems don't have that problem because generally the individual features are selected in an unbiased way, relative to the algorithm used.

So I'm not saying SA's Bayesian component is worse than others, just that I'm not convinced combining it with their other rules is necessarily a good idea. The speed issues are definitely real, though.

[ Parent ]

Umm... (none / 1) (#61)
by warrax on Fri Feb 06, 2004 at 05:45:22 AM EST

All I'm saying is that you can turn off all the other things that they filter on (by default) and use your own ham/spam database for their Bayesian classifier (man sa-learn). That gives you the same behavior as a "naive" Bayesian filter. I'm not talking about the default setup. That setup is clearly not the same as a "naive" Bayesian filter; whether it's better or not is left as an exercise for the interested reader. :)

-- "Guns don't kill people. I kill people."
[ Parent ]
SpamAssassin Bayesian before Scoring (none / 1) (#50)
by dcturner on Tue Feb 03, 2004 at 03:49:32 AM EST

Can anyone explain why SpamAssassin does a variety of fancy Bayesian tests on a message, then assigns some fairly arbitrary number of points based on that and a load of static heuristic criteria?

Surely the correct way to do this would be to take the heuristics and put them in the Bayes database so in my installation, for example FROM_ENDS_IN_NUMS gets near-zero probability (the username format at my site is roughly [a-z]{2,}[0-9]+) but other tests like faked MUAs end up with very high probabilities. All this should be done dynamically, automatically and meaningfully. I've no idea what to set my threshold at, because a spam score of 5 means nothing. I'd like to set it at a probability level of 0.99 or so. That'd make more sense, surely?

Mind you, it still works for me, trained on 1700 spam and 900 ham I have had no false positives and only a handful of false negatives. I suppose if it works, I shouldn't complain. It just seems an inelegant solution.

Remove the opinion on spam to reply.


Fusing Bayes and scored-rules (none / 0) (#55)
by PigleT on Tue Feb 03, 2004 at 11:49:42 AM EST

Well, Bayes implementations tend not to bother with heuristics such as "From ends in numbers"; the general rule seems to be that if this factoid is true, it'll be reflected in the Bayes database. (Have a look through it with _strings_, some time - see if you can work out new rules?)

The point is that SA works by linear combination of scores for various rules, and Bayes, at a complete code-module, is a rule all of its own (well, several, for the various percentage-ranges) whose usefulness values you have to guage for yourself. If you spent long enough, I don't doubt you could find some way to encompass SA's other heuristics *somehow* into the bayes database, but by that time you'd be bored, and you wouldn't be able to tell Bayes' contribution to the score apart from everything else.

In fact, if you want a pure Bayes-only solution (as Paul Graham might be read as advocating), you probably want something simpler, maybe like Bogofilter.

~Tim -- We stood in the moonlight and the river flowed
[ Parent ]

Update to Article - using spamd (none / 1) (#62)
by mbreyno on Fri Feb 06, 2004 at 03:01:48 PM EST

After I wrote this article, I quickly learned that I had not set up SpamAssassin in the most efficient manner. I set this up on a server with 256MB RAM and 200 mail users. SpamAssassin quickly brought the server to its knees since it was being invoked for every message for every user. Ack! Instead, I learned to use spamd and spamc. See: For my revised article.

SpinWeb: Intelligent Internet Software
ZenBox: Open Source Alternative Health

Maildir changes (none / 0) (#63)
by icebike on Sat Feb 07, 2004 at 08:42:53 PM EST

I found this most usefull, as I have been looking for a way to integrate sa-learn in a system wide installation. But I use maildir format (as do most newer installations) and by a slight modification to the commands I was able to accomodate this format:

system("$SA_LEARN --ham $user_notspam_folder/cur/*"

Summary: remove the --mbox parameter and add a slash asterisk at the end. Needed in two places. Then change the cat of dev/nul with a rm of all files in that folder.

Email filtering with Procmail + SpamAssassin + ClamAV | 65 comments (54 topical, 11 editorial, 2 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!